Long sentence shortener to generate a readable file name - node.js

I'm looking for a way to shorten a sentence (a text of few lines) to produce a "readable" (not too long) file name.
The application scenario is a chatbot where user can submit a media, say a video, with some paired description text (a caption). The application would assign to the video a readable file name, to retrieve afterward the video by his file name.
Imagine a video paired with a more or less long text description of the scene, like by example:
const videoDescription = 'beautiful yellow flowers on foreground, with a background with countryside meadows and many cows'
How could I shorten the description above with a "suitable" short file name?
Ok, I could just give the sentence as a name, maybe something a bit sanitized, like:
const videoFileName = 'beautiful_yellow_flowers_on_foreground_with_a_background_with_countryside_meadows_and_many_cows.MP4'
but in that way I could exceed the 255 limit of file name size (e.g. on Linux)
Any idea for a shortener algo?
Maybe I could build the shortened filename with word abbreviations?
Maybe I could remove from sentence articles, prepositions, etc.?
BTW, a minor issue: I'm working with Italian language, so a bit of chars sanitize is required to produce good filenames.
Last but not least, I'd looking for JavaScript/Node.js code

You can check if the length is larger than 255 and shorten if necessary. You should also check for duplicates and append -1, -2 and so on if necessary.
let filename='some_flowers_on_foreground_with_a_background_with_countryside_meadows_and_few_cows.MP4'
if(filename.length>255)
filename=filename.slice(0,255-4)+'.MP4'

Related

batch file extract numbers from text file with little information

So This is related to my other two posts. Im dealing with extracting text from a text file and analyzing it and I've run into some problems. For A while I've been using a method that sets all the text between two other strings as a variable, but here is the situation I have. I need to extract the speed (numbers) from the below string: "etc...,query":{"ping":47855},"cmts":...etc. The problem is that the text cmts sometimes changes to something else so really I need to extract all the numbers from this:
,query":{"ping":47855},"
One more thing that makes this difficult is that the characters }," Are all over the file. Thank you for helping me! -Lucas EDG Programmer.
Here's the full file:
{"_id":53291,"ip":"158.69.22.95","domain":"jectile.com","port":25565,"url":"","date_add":1453897770,"status":1,"scan":1,"uptime":99.53,"last_update":1485436105,"geo":{"country":"US","country_name":"United States","city":"Lake Forest"},"info":{"name":" Jectile | jectile.com [1.8-1.11]\n Shoota (Call of Duty) \/ Zambies (Zombie Survival)","type":"FML","version":"1.10","plugins":[],"players":18,"max_players":420,"players_list":[],"map":"world","software":"BungeeCord 1.8.x, 1.9.x, 1.10.x, 1.11.x","avg_player_day":24.458333,"avg_load_day":5.8234,"platform":"MINECRAFT","icon":true},"counter":{"online":47871,"offline":228,"players":{"date":"2017-01-26","total":0},"last_offline":0,"query":{"ping":47855},"cmts":1},"rating":{"main":19.24,"difference":-0.64,"content_up":0.15,"K":0},"last":{"offline":1485415702,"online":1485436105},"chart":{"14:30":14,"14:40":16,"14:50":15,"15:00":18,"15:10":12,"15:20":13,"15:30":9,"15:40":9,"15:50":11,"16:00":12,"16:10":11,"16:20":11,"16:30":18,"16:40":25,"16:50":23,"17:00":27,"17:10":27,"17:20":23,"17:30":24,"17:40":26,"17:50":33,"18:00":31,"18:10":31,"18:20":32,"18:30":37,"18:40":38,"18:50":39,"19:00":38,"19:10":34,"19:20":33,"19:30":40,"19:40":36,"19:50":37,"20:00":38,"20:10":36,"20:20":38,"20:30":37,"20:40":37,"20:50":37,"21:00":34,"21:10":32,"21:20":33,"21:30":33,"21:40":29,"21:50":28,"22:00":26,"22:10":21,"22:20":24,"22:30":29,"22:40":22,"22:50":23,"23:00":27,"23:10":24,"23:20":26,"23:30":25,"23:40":28,"23:50":27,"00:00":32,"00:10":29,"00:20":33,"00:30":32,"00:40":31,"00:50":33,"01:00":40,"01:10":40,"01:20":40,"01:30":41,"01:40":45,"01:50":48,"02:00":43,"02:10":45,"02:20":46,"02:30":46,"02:40":43,"02:50":42,"03:00":39,"03:10":36,"03:20":44,"03:30":34,"03:40":0,"03:50":32,"04:00":35,"04:10":35,"04:20":33,"04:30":43,"04:40":37,"04:50":26,"05:00":31,"05:10":31,"05:20":27,"05:30":25,"05:40":26,"05:50":18,"06:00":13,"06:10":15,"06:20":17,"06:30":18,"06:40":17,"06:50":15,"07:00":16,"07:10":17,"07:20":16,"07:30":16,"07:40":18,"07:50":19,"08:00":14,"08:10":12,"08:20":12,"08:30":13,"08:40":17,"08:50":20,"09:00":18,"09:10":0,"09:20":0,"09:30":27,"09:40":18,"09:50":20,"10:00":15,"10:10":13,"10:20":12,"10:30":10,"10:40":10,"10:50":11,"11:00":13,"11:10":13,"11:20":16,"11:30":19,"11:40":17,"11:50":13,"12:00":10,"12:10":11,"12:20":12,"12:30":16,"12:40":15,"12:50":16,"13:00":14,"13:10":10,"13:20":13,"13:30":16,"13:40":16,"13:50":17,"14:00":20,"14:10":16,"14:20":16},"query":"ping","max_stat":{"max_online":{"date":1470764061,"players":129}},"status_query":"ok"}
By the way, the reason things change is because it looks at info from different servers
Very similar to ther answer I gave you to your first question:
#Echo Off
Set/P var=<some.json
Set var=%var:*:{"ping":=%
Set var=%var:},=&:%
Echo=%var%
Timeout -1

Labelling text using Notepad++ or any other tool

I have several .dat, containing information about hotel reviews as below
/*
<Author> simmotours
<Content> review......goes here
<Date>Nov 18, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>4`enter code here`
<Value>4
<Rooms>3
<Location>4
<Cleanliness>4
<Check in / front desk>4
<Service>4
<Business service>-1
*/
I want to classify the review into two pos and neg , i.e. have two folder pos and neg containing several files with reviews above 3 classified as positive and below 3 classified as negative.
How can I quickly and efficiently automate this process?
You could write up a python script to read the overall score. Do this by looping over the the lines using readline() See here. Find the "Overall" Score using some string parsing. Then move the file into the right directory. All very simple things to do in Python, just break it down into steps and search for answers to those steps.
Notepad++ can do replacements with regular expressions. And allows the definition of macros. Use them to convert the file to an XML file. Check out the help file.
Then you can read it with any scripting language and do what you want.
Alternatively you could change the file to a form where you can load it into Excel and do the analysis there.

Allocating matrix / structure data instead of string name to variable

I have a script that opens a folder and does some processing on the data present. Say, there's a file "XYZ.tif".
Inside this tif file, there are two groups of datasets, which show up in the workspace as
data.ch1eXYZ
and
data.ch3eXYZ
If I want to continue with the 2nd set, I can use
A=data.ch3eXYZ
However, XYZ usually is much longer and varies per file, whereas data.ch3e is consistent.
Therefore I tried
A=strcat('data.ch3e','origfilename');
where origfilename of course is XYZ, which has (automatically) been extracted before.
However, that gives me a string A (since I practically typed
A='data.ch3eXYZ'
instead of the matrix that data.ch3eXYZ actually is.
I think it's just a problem with ()'s, []'s, or {}'s but Ican't seem to figure it out.
Thanks in advance!
If you know the string, dynamic field references should help you here and are far better than eval
Slightly modified example from the linked blog post:
fldnm = 'fred';
s.fred = 18;
y = s.(fldnm)
Returns:
y =
18
So for your case:
test = data.(['ch3e' origfilename]);
Should be sufficient
Edit: Link to the documentation

Python - CSV Module, Getting Information From a File

Here is the situation:
The first problem I'm having is with obtaining information from a CSV file. The purpose of the code I'm writing is to get a bunch of information on ZCTAs (zip codes), for a number of different cohorts (there are six currently being used, but the code is meant to be flexible to have any number of cohorts). One file contains the population, by cohort, for each ZCTA. Another file has the number of 'cases' (cases of cancer observed) for each cohort, for each ZCTA. Another file has the crude rate for each cohort, for the state of Iowa (the focus of this research), for the rate at which one can 'expect' to see the number of people who have cancer, for a population, by cohort. There are a couple of other files, but these are the focus, as this is where my issue is exhibited.
What my code does, initially, is to read the population file and get the population of each cohort by ZCTA. Each ZCTA, and the information, is stored in a list, which is then stored in a list of lists (nested), containing all of the ZCTAs. The code then gets the crude rate. Then, the crude rate is taken times the appropriate cohort, for each ZCTA and summed with all of the other cohorts within each ZCTA, to get the total number of people we can EXPECT to see having cancer, for each ZCTA. The population is also summed up. This information is stored in a another list, as well as a list containing all of the ZCTAs. This information will be the focus (The list of all of the ZCTAs, which each contain the total population and the total number of expected cases).
So, the problem is that I then need to take this newly acquired list and get the number of OBSERVED cases, for each cohort, sum those together, append it to the appropriate ZCTA and write it to a new file. I have code implemented that does this fine, EXCEPT that the bottom 22 or so ZCTAs don't get the number of observed cases. I don't know if it is the code, or what, but it works for all of the other 906, but doesn't get the bottom 22.
The reader will find sample data for the files I've discussed (the observed case file, and the output file) at: Gist
Here is the code I'm using:
`expectedcsv = open('ExpectedCases.csv', 'w', newline= '')
expectedwriter = csv.writer(expectedcsv, delimiter = ',')
expectedHeader = ['zcta', 'expected', 'pop', 'observed']
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for zcta in zctaPop:
caseCounter = 0
thecasescsv = open('NewCaseFile.csv', 'r', newline = '')
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for case in thecasesreader:
if case[0] == zcta[0]:
for i in range(3, len(case)):
caseCounter += int(case[i])
zcta.append(caseCounter)
expectedwriter.writerow(zcta)
expectedcsv.close()
thecasescsv.close()`
Something else I would also like to bring up is that later on in the code, the actual purpose for all of this, is to create an SMR filter, for each grid point. The grid points are somewhat arbitrary they have been placed (via coordinates) over the entire state of Iowa. The SMR is the number of observed divided by the number of expected cases. The threshold, that is, how many expected cases for a particular filter, is set by the user. So, if a user wants a filter created on 150 expected cases (for each grid point), the code goes through each ZCTA, summing up the expected cases until greater than 150 are found. The distance to this last ZCTA is the 'radius' of the filter.
To do this, I built a distance matrix (the distance from each grid point to every ZCTA) and then sorted it, nearest to furthest. Because of the size of the file (2300 X 930), I have to read this file line by line and get all of the information from other files. So, starting with the nearest ZCTA, I get the population, expected cases, and observed cases (the problem with this file was discussed above) and add these each to their respective counter (one for population, one for observed and one for expected). Then it goes to the next closest ZCTA and does the same, until the the threshold is exceeded.
The problem here is that I couldn't use the CSV Module to read these files, as I was already reading from another file and the index would be lost. So, I had to use just the regular filename.read(), which then required some interesting use of maketrans and .translate. I'm not sure its efficient or works great. Everything seems to be fine, but without the above problem being fixed, it's impossible to tell. I have included the code below, but was wondering if anybody had any better ideas/suggestions?
`expectedCSV = open('ExpectedCases.csv', 'r', newline = '')
table = str.maketrans('\r', ' ')
content = expectedCSV.read()
expectedCSV.close()
content = content.translate(table)
content = content.split(sep = '\n')
newContent = []
for item in content:
newContent.append((item.split(sep= ',')))
content = ' '
for item in newContent:
if item[0] == currentZcta:
expectedTotal += (float(item[1]))
totalPop += (float(item[2]))
totalObservedCount += (float(item[3]))`
Also, I couldn't figure out how to color the methods blue and the variables red, as some of the more awesome users of this site do. I would be very much interested in learning how to do that for future posts.
If anybody needs more info or anything clarified to help answer/formulate a solution, please, by all means, ask! Thanks for taking the time to read!
So, I ended up "solving" this by computing the observed along with the expected and population, by opening the file for each ZCTA computed. This did not really solve the issue I was dealing with, but rather found a way around it. I'm somewhat disappointed that more people didn't view and/or respond to this. If someone comes up with an answer to the actual problem, by all means, post it here. -Mike

Start loop at specific line of text file in groovy

I am using groovy and I am trying to have a text file be altered at specific line, without looping through all of the previous lines. Is there a way to state the line of a text file that you want to wish to alter?
For instance
Text file is:
1
2
3
4
5
6
I would like to say
Line(3) = p
and have it change the text file to:
1
2
p
4
5
6
I DO NOT want to have to do a loop to iterate through the lines to change the value, aka I do not want to use a .eachline {line ->...} method.
Thank you in advance, I really appreciate it!
I dont think you can skip lines and traverse like this. You could do the skip by using the Random Access File in java, but instead of lines you should be specifying the number of bytes.
Try using readLines() on file text. It will store all your lines in a list. To change content at line n, change content at n-1 index on list and then join on list items.
Something like this will do
//We can call this the DefaultFileHandler
lineNumberToModify = 3
textToInsert = "p"
line( lineNumberToModify, textToInsert )
def line(num , text){
list = file.readLines()
list[num - 1] = text
file.setText(list.join("\n"))
}
EDIT: For extremely large files, it is better that you have a custom implementation. May be something on the lines of what Tim Yates had suggested in the comment on your question.
The above readLines() can easily process upto 100000 lines of text within less than a sec. So you can do something like this:
if(file size < 10 MB)
use DefaultFileHandler()
else
use CustomFileHandler()
//CustomFileHandler
- Split the large file into buckets of acceptable size.
- Ex: Bucket 1(1-100000 lines), Bucket 2(100000-200000 lines), etc.
- if (lineNumberToModify falls in bucket range)
insert into line in the bucket
There is no hard and fast rule to define how you implement your CustomFileHandler as it completely depends on the use case scenario. If you need to do the above operation multiple times on the same file, you can choose to do the complete bucket split first, store them in memory and use the buckets for the following operations. Or if it is a one time operation, you can avoid manipulating all the buckets first but deal with only what you need and process the others later on on-demand basis.
And even within the buckets you can define your own intelligence to speed up your job. Say if you want to insert into 99999 line of a bucket with 1-100000 lines, you can exploit groovy's methods and closures to their fullest,
file.readLines().reverse()[1] = "some text"

Resources