Remove duplicate line in text file

Remove duplicate line in text file - text

I have searching around, but not able to get some auto script that perform overall tasks below:
1) go through all text files from a folder
2) remove duplicate line/row from the text file (text is already sorted, so can skip the sorting part)
3) save & overwrite the text files
Unfortunately, all the result I searched only to remove line from 1 specific file, and save as another file name.
Then i will set a schedule task to run this script.
I don't have any script knowledge, only have few experience on batch script setup. Your help and guide would be much appreciated.

Unfortunately, all the result I searched only to remove line from 1 specific file, and save as another file name.
I think you have your answer right here. I don't know which language you're writing in, but typically in this scenario I would do something as such.
Open file A
Read lines
Sort lines
Remove duplicate lines
Save as file B
Close file A
Rename file A to _backup or _original (unnecessary, but a good safe guard for data loss prevention)
Rename file B to file A
Again I don't know which language you're writing in etc... there really isn't enough detail here to answer the question any further.
The key point though is to simply delete your original file, and rename your new file to the original.

I wrote and commented a little script in GoLang for you It might help in your case if you know how to run it. If not, quick research will help you.
package main
import (
"io/ioutil"
"strings"
"log"
"os"
)
func main() {
// get all files in directory
files, err := ioutil.ReadDir(".")
// check error
if err != nil { log.Println(err) }
// go through all the files
for _, file := range files {
// check if it's a txt file (can change this)
if strings.HasSuffix(file.Name(), "txt") { // you can change this
// read the lines
line, _ := ioutil.ReadFile(file.Name())
// turn the byte slice into string format
strLine := string(line)
// split the lines by a space, can also change this
lines := strings.Split(strLine, " ")
// remove the duplicates from lines slice (from func we created)
RemoveDuplicates(&lines)
// get the actual file
f, err := os.OpenFile(file.Name(), os.O_APPEND|os.O_WRONLY, 0600)
// err check
if err != nil { log.Println(err) }
// delete old one
os.Remove(file.Name())
// create it again
os.Create(file.Name())
// go through your lines
for e := range lines {
// write to the file without the duplicates
f.Write([]byte(lines[e] +" ")) // added a space here, but you can change this
}
// close file
f.Close()
}
}
}
func RemoveDuplicates(lines *[]string) {
found := make(map[string]bool)
j := 0
for i, x := range *lines {
if !found[x] {
found[x] = true
(*lines)[j] = (*lines)[i]
j++
}
}
*lines = (*lines)[:j]
}
Your file: hello hello yes no
Returned result: hello yes no
if you run this program in the directory with all your files, it'll remove the duplicates.
Hope it fits your needs.

Related

Unable to process strings in files with different encoding [Not accepting answers]

I am trying to process the strings present in a particular file, The file is written in English. The problem arises when the encoding of the file differs from "UTF-8". But the file with encoding as "UTF-16-le" does not behave as expected. My main goal is manipulate the strings from the read file. For example the strings.TrimSpace() only works with the UTF-8 file,
I am aware that golang only supports UTF-8 by default, Any alternate approach would be helpful.
Personal Question
Also I would like to point out, many new programming languages, do process the strings irrespective of the encoding, And why does Go only support UTF-8. If at least there would be an alternative way to pass the encoding format to the reader, that might still help.
What I tried
I tried using utf-8 and utf-16 std packages
Code
(main.go)
sample code to show the difference.
package main
import (
"fmt"
"io/ioutil"
"net/http"
"strings"
)
func processFile(src string) {
data, _ := ioutil.ReadFile(src)
fmt.Println("--- original source ---")
fmt.Println(string(data))
fmt.Println(http.DetectContentType(data))
fmt.Println("\n--- modified source ---")
for _, val := range strings.Split(string(data), "\n") {
fmt.Println(strings.TrimSpace(val))
}
}
func main() {
processFile("./utf-16-english.txt")
processFile("./utf-8-english.txt")
}
File-1
(utf-8-english.txt)
Hello
This is
Sample
Document
File-2
(utf-16-english.txt)
Hello
This is
Sample
Document
EDIT
Seems that the only way to process strings in a better way is to convert them to UTF-8. Kindly refer the marked answer.
As per comments I have written the result from the program to respective files. And the special symbols are not present, but the process with strings, works fine with UTF-8

You have to decode the utf-16 encoded file. The decoding will convert the input to utf-8, after which you can use the string libraries to process the input.
You can use something like this:
import "unicode/utf16"
func processFile(src string, decode func(in[]byte) string) {
data, _ := ioutil.ReadFile(src)
fmt.Println("--- original source ---")
fmt.Println(decode(data))
fmt.Println("\n--- modified source ---")
for _, val := range strings.Split(decode(data), "\n") {
fmt.Println(strings.TrimSpace(val))
}
}
func main() {
processFile("./utf-16-english.txt",func(in []byte) string {
return string(utf16.Decode(in)) })
processFile("./utf-8-english.txt",func(in []byte) string {
return string(in)})
}

Lua-filters in Pandoc: replacing string with text from file

Fair warning, I'm rather inexperienced with programming in general.
When generating outputs from markdown documents, my goal is to replace placeholders (e.g. '{{string}}') with text from an external file with the same name (i.e. 'string.md'). I've been able to achieve replacing the string with the first line of text of an .md file.
However, I'm struggling to find a way to replace a string with multiple lines of text (essentially an include function).
The code for the first task was based on some examples from the Pandoc manual:
function file_exists(file)
local f = io.open(file, "rb")
if f then f:close() end
return f ~= nil
end
function lines_from(file)
if not file_exists(file) then return {} end
lines = {}
for line in io.lines(file) do
lines[#lines + 1] = line
end
return lines
end
return {
{
Str = function (elem)
if elem.text:match ("{{.+}}") then
file_name = elem.text:match ("{{.+}}")
file_name = file_name:gsub("{{","")
file_name = file_name:gsub("}}","")
local file = 'lua-filters/'..file_name..'.md'
if not file_exists(file) then
return elem
else
local lines = lines_from(file)
return pandoc.Str(elem.text:gsub("{{.+}}",lines[1]))
end
else
return elem
end
end,
}
}
Every time a place holder '{{string}}' is found, it will be replaced with the first line of the corresponding file 'string.md'. An example:
{{project}} is the project name.
The project number is {{projectno}}.
turns into:
Project Name is the project name.
The project number is 456321.
With a second lua-filter, I want to be able to use a place holder string, where - instead of only the first line - the full content of the text file is returned. However, I've been unable to either find a way to return all lines of the corresponding file (using the same code as above) or convert the content of the file in a more appropriate element.
The desired outcome would be:
## Include section
}}lorem-ipsum{{
Turns into:
## Include section
Lorem ipsum dolor sit amet....
Ut enim ad minim veniam...
With "Lorem ipsum..." being the context of the file 'lorem-ipsum.md'. Using pairs I was only able to return the first line:
if not file_exists(file) then
return elem
else
local lines = lines_from(file)
for k,v in pairs(lines) do
return pandoc.Str(v)
end
Another approach might be to use Para and elem.content[1].text instead of Str and elem.text, but I've been unable to figure out if/how the file should be handled differently.
Any help would be greatly appreciated.

Need help finding largest value in a column in a JSON file (node.js)

I have an assignment in my computation for geophysicist course; the task is basically finding the largest value in a column of a txt file (the value is the magnitude of the earthquake, and the file contains all earthquakes from 1990-2000). Then take the latitude and longitude of such earthquake(s) and plot it into a map.
The task is quite simple to do in python, but since I am devoting some free time to study webdev I am thinking of making a simple webapp that would do the complete task.
In other words, I would upload a file into it, and it would automatically appoint the biggest earthquakes into a google map.
But since I am kind of a noob in node.js I am having a hard time starting the project, so I am breaking it into parts, and I need help with the first part of it.
I am thinking of converting the txt.file with data into a .csv file and subsequently converting it into a .json file. I have absolutely no idea what algorithm I should use to scan the .json file and find the largest value of a given column.
Here is the first row of the original .txt file:
1 5 0 7.0 274 102.000 -3.000
here it is on a .csv file, using an online converter:
1 5 0 7.0 274 102.000 -3.000
And here it is on the .json file, again, using on a online converter:
1\t 5\t 0\t7.0\t274\t102.000\t -3.000
Basically I need to scan all the rows and find the largest value of the 5th column.
Any help on how I would start writing this code?
Thanks very much.
TLDR version:
Need to find the largest value of the 5th column in a JSON file with multiple rows.

I had a go at this as a one-liner, code golf style. I'll leave out the usual "don't get Stack Overflow to do your homework for you" shtick. You're only cheating yourself, kids these days, yada yada.
Split, map, reduce.
let data = require('fs').readFileSync('renan/homework/geophysicist_data.txt')
let biggest = data.split('\n')
.map(line => line.split(/[ ]+/)[4])
.reduce((a, b) => Math.max(a, b))
Having loaded up the data we process it in 3 steps.
.split('\n') By splitting on the newline character we break the text file down into an array, so that each line in the text file is converted into an item in the array.
.map(line => line.split(/[ ]+/)[4]) 'map' takes this array of lines and runs a command on every single line individually. For each line we tell it that one-or-more spaces (split(/[ ]+/)) is the column separator, and once it's been broken into columns to take the fifth column in that line (We use [4] instead of [5] because javascript starts counting from 0).
.reduce((a, b) => Math.max(a, b)) Now we have an array containing only the fifth column numbers, we can send the array directly to Math.max and let it do the hard work calculating our answer for us. Hooray!
If this data is even a little bit un-uniform it would be very easy to break this, but i'm assuming because it's a homework assignment that is not the case.
Good luck!

If your file contains just a lines with numbers with the same structure I'd not convert it to csv or json.
I'd simply go for parsing the .txt manually. Here is the code snippet how you could do this. I used 2 external modules: lodash (ultra-popular unility library for data manipulations) and validator (helps to validate strings):
'use strict';
const fs = require('fs');
const _ = require('lodash');
const os = require('os');
const validator = require('validator');
const parseLine = function (line) {
if (!line) {
throw new Error('Line not passed');
}
//Splitting a line into tokens
//Some of tokens are separated with double spaces
//So using a regex here
let tokens = line.split(/\s+/);
//Data validation
if (!(
//I allowed more tokens per line that 7, but not less
tokens && tokens.length >= 7
//Also checking that strings contain numbers
//So they will be parsed properly
&& validator.isDecimal(tokens[4])
&& validator.isDecimal(tokens[5])
&& validator.isDecimal(tokens[6])
)) {
throw new Error('Cannot parse a line');
}
//Parsing the values as they come as string
return {
magnitude: parseFloat(tokens[4]),
latitude: parseFloat(tokens[5]),
longitude: parseFloat(tokens[6])
}
};
//I passed the encoding to readFile because if I would not
//data would be a buffer, so we'd have to call .toString('utf8') on it.
fs.readFile('./data.txt', 'utf8', (err, data) => {
if (err) {
console.error(err.stack);
throw err;
}
//Splitting into lines with os.EOL
//so our code work on Linux/Windows/Mac
let lines = data.split(os.EOL);
if (!(lines && lines.length)) {
console.log('No lines found.');
return;
}
//Simple lodash chain to parse all lines
//and then find the one with max magnitude
let earthquake = _(lines)
.map(parseLine)
.maxBy('magnitude');
//Simply logging it here
//You'll probably put it into a callback/promise
//Or send as a response from here
console.log(earthquake);
});

Vertex failure triggered quick job abort - Exception thrown during data extraction

I am running a data lake analytics job, and during extraction I am getting an error.
I use in my scripts TEXT extractor and also my own extractor. I try to get data from a file containing two columns separated by a space character. When I run my scripts locally everything works fine, but not when I try to run scripts using my DLA account. I have the problem only when I try to get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that the exception is throwing when total number of vertices is larger than the one for the extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?
Error message:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Script code:
#result =
EXTRACT s_date string,
s_time string
FROM #"/Samples/napis.txt"
//USING USQLApplicationTest.ExtractorsFactory.getExtractor();
USING Extractors.Text(delimiter:' ');
OUTPUT #result
TO #"/Out/Napis.log"
USING Outputters.Csv();
Code behind:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (StreamReader sr = new StreamReader(input.BaseStream))
{
string line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
string[] words = line.Split(' ');
int i = 0;
foreach (var c in output.Schema)
{
output.Set<object>(c.Name, words[i]);
i++;
}
yield return output.AsReadOnly();
}
}
}
}
public static class ExtractorsFactory
{
public static IExtractor getExtractor()
{
return new MyExtractor();
}
}
Part of sample file:
...
str1 str2
str1 str2
str1 str2
str1 str2
str1 str2
...
In job resources i found jobError message:
"Unexpected number of columns in input stream."-"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns- processed 1 columns out of 1."-"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.
But I checked the file again and I don't see an incorrect number of columns. Is it possible that the error is caused by an incorrect file split and distribution? I read that the big files can be extracted in parallel.
Sorry for my poor English.

The same question was answered here: https://social.msdn.microsoft.com/Forums/en-US/822af591-f098-4592-b903-d0dbf7aafb2d/vertex-failure-triggered-quick-job-abort-exception-thrown-during-data-extraction?forum=AzureDataLake.
Summary:
We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.
If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.

Using groovy to append to a certain line number in a file

Rather than appending to the end of a file, I am trying to replace or append to a certain line using groovy File manipulation.
Is there a method to do this?
append adds to end of file while write overwrites existing file:
def file = new File("newFilename")
file.append("I wish to append this to line 8")

In general, with Java and Groovy file handling, you can only append to the end of files. There is no way to insert information, although you can overwrite data anywhere without shifting the position of what follows.
This means that to append to a specific line that is not at the end of the file you need to rewrite the whole file.
For example:
def file = new File("newFilename")
new File("output").withPrintWriter { out ->
def linenumber = 1
file.eachLine { line ->
if (linenumber == 8)
out.print(line)
out.println("I wish to append this to line 8")
} else {
out.println(line)
}
linenumber += 1
}
}

For small files you can use the following piece of code:
def f = new File('file')
def lines = f.readLines()
lines = lines.plus(7, "I'm a new line!")
f.text = lines.join('\n')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string