We have a txt file that is getting populated by a database dump, but there are CR and LF breaks, that we don't want. Basically, I am trying to edit C:\app.txt, remove all CRs and LFs, and then add !## in front of "_TEXT_", and add a CR in front of "!##_TEXT_". This way, I only have CR in the places I want them, not all over the place.
I have tried using change.exe by Bruce Gunthrie, which worked well in a 32 bit environment, but doesn't work on a 64bit PC.
Any help would be greatly appreciated.
I saw some similar posts here, but have trouble reading the codes because they are too complex, so I didn't know how to adapt them for our environment.
Thanks
Luke
eg.
_TEXT_ data data1
data2 data3 data4
_TEXT_ data data1
data2 data3 data4
Should read:
_TEXT_ data data1 data2 data3 data4
_TEXT_ data data1 data2 data3 data4
Is PowerShell an option? (I would guess so, a 64-bit computer with a current Windows version has PowerShell installed.)
$(
$line = ''
switch -wildcard -File C:\app.txt {
'_TEXT_*' {
if ($line) { $line }
$line = "!##$_"
}
default {
$line += ' ' + $_
}
}
if ($line) { $line }
) -join "`r"
This will join the lines with CR, as you wished. Pipe the result to Set-Content to write it to a file.
Related
I have a text file structured like this:
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
[timestamp6] footer with space
[timestamp7] junk
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
[timestamp12] footer with space
[timestamp13] junk
[timestamp14] header with space
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
[timestamp19] footer with space
I need to find each part between header and footer and save it in another file. For example the file1 should contain (with or without timestamps; doesn't matter):
data1
data2
data3
..
and the next pack should be saved as file2 and so on.
This seems like a routine process, but I haven't find a solution yet.
I have this sed command that finds the first packet.
sed -n "/header/,/footer/{p;/footer/q}" file
But I don't know how to iterate that over the next matches. Maybe I should delete the first match after copying it to another file and repeat the same command
I would harness GNU AWK for this task following way, let file.txt content be
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
[timestamp6] footer with space
[timestamp7] junk
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
[timestamp12] footer with space
[timestamp13] junk
[timestamp14] header with space
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
[timestamp19] footer with space
then
awk '/header/{c+=1;p=1;next}/footer/{close("file" c);p=0}p{print $0 > ("file" c)}' file.txt
produces file1 with content
[timestamp1] header with space
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
and file2 with content
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
and file3 with content
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
Explanation: my code has 3 pattern-action pairs, for line containing header I increase counter c by 1 and set flag p to 1 and go to next line so no other action is undertaken, for line cotaining footer I close file named file followed by current counter number and set flag p to 0. For lines where p is set to true I print current line ($0) to file named file followed by current counter number. If required adjust /header/ and /footer/ to contant solely on lines which are header and footer lines.
(tested in GNU Awk 5.0.1)
Using any awk:
$ awk '/footer/{f=0} f{print > out} /header/{close(out); out="file" (++c); f=1}' file
$ head file?*
==> file1 <==
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
==> file2 <==
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
==> file3 <==
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
A very naive approach, coded fast, could be improved, but seems to work, in awk:
BEGIN {
i = 0
}
{
if ($0 == "header") {
write = 1
} else if ($0 == "footer") {
write = 0
i = i + 1
} else {
if (write == 1) {
print $0 > "file"i
}
}
}
Based on THIS REGEX, here is a ruby:
ruby -e 'cnt=1
$<.read.scan(/^.*\bheader\b.*\s+([\s\S]*?)(?=^.*\bfooter\b)/){
|b| File.write("File_#{cnt}.txt", b[0])
cnt+=1
}' file
Produces:
$ head File_*
==> File_1.txt <==
[timestamp2] data1
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
==> File_2.txt <==
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
==> File_3.txt <==
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
If you want to remove the timestamps:
ruby -e 'cnt=1
$<.read.scan(/^.*\bheader\b.*\s+([\s\S]*?)(?=^.*\bfooter\b)/){ |b|
File.write("File_#{cnt}.txt", b[0].gsub(/^\[[^\]]+\]\s+/,""))
cnt+=1
}' file
$ head File_*
==> File_1.txt <==
data1
data2
data3
..
==> File_2.txt <==
data4
data5
...
==> File_3.txt <==
data6
data7
data8
..
Note: If you want to include the header and/or footer, just move the capture group to include what you want.
This might work for you (GNU csplit and sed):
csplit -qf file -b '%d' --supp file '/header/' '{*}' && sed -i '/footer/,$d' file? && rm file0
Use csplit to split file into multiple filen files on header suppressing the matching line.
Use sed to delete footer and any following lines`.
Remove the unwanted file0 file.
Alternative:
sed -En '/header/{x;s/.*/echo $((0&+1))/e;x};/header/,/footer/!b;//b;G;s/(.*)\n/echo "\1" >>file/e' file
I am trying to go through a folder containing multiple CSV files and combine them into a single larger file. To do that I want to copy the info in the CSV along with its name (ie. foo from csvfiles/foo.csv) into a new CSV. How do I get the name of the CSV that I'm reading from as I read the whole set?
I am using
IFS=","
files=/CSVFiles/*
cat $files|while read col1 col2
do
echo $col1 $col2
done
I just need to get the name of the specific file that col1 and col2 are from. I tried putting the while loop inside a a for f in $files but got an unspecified error. Any ideas?
You can use awk:
cd CSVFiles
awk -F , -vOFS=, '
FNR == 1 {
f = FILENAME
sub(/\.csv$/, "", f)
print f
}
{print $1,$2}' *.csv
You can use f where you want, eg. like {print f": "$1,$2}.
I'm using a POWERSHELL script which converts a specific log format to a tab or comma separated (CSV) format and it looks like this:
$filename = "filename.log"
foreach ($line in [System.IO.File]::ReadLines($filename)) {
$x = [regex]::Split( $line , 'regex')
$xx = $x -join ","
$xx >> Results.csv
}
And it works fine, but for a 20MB log file it takes almost 20 min to be converted! Is there a way to accelerate it?
My System: CPU: Corei7 3720QM / RAM: 8GB
Update: The log format is like this:
192.168.1.5:24652 172.16.30.8:80 http://www.example.com "useragent"
I want destination format to be:
192.168.1.5,24652,172.16.30.8,80,http://www.example.com,"useragent"
REGEX: ^([\d\.]+):(\d+)\s+([\d\.]+):(\d+)\s+([^ ]*)\s+(\".*\")$
As Lieven Keersmaekers points out, you can do a single -replace operation to do the work.
Additionally, foreach($thing in $o.GetThings()){} will initially block until GetThings() return and then store the entire result in memory, which you have no need for. You can avoid this by using the pipeline instead.
Finally, your regex can be simplified so that the engine doesn't have to parse the entire string before splitting, by matching on either : preceded by a digit or whitespace:
Get-Content filename.log |ForEach-Object {
$_ -replace '(?:(?<=\d)\:|\s+)',','
} |Out-File results.csv
I was wondering if someone could help me better understand what this given code to parse a text file is doing.
while ($line = <STDIN>) {
#flds = split("\t", $line);
foreach $fld (#flds) {
if ($fld =~ s/^"(.*)"$/\1/) {
$fld =~ s/""/"/g;
}
}
print join("\t", #flds), "\n";
}
We are given this block of code as a start to parse a text file such as.
Name Problem #1 Comments for P1 E.C. Problem Comments Email
Park, John 17 Really bad. 5 park#gmail.edu
Doe, Jane 100 Well done! 0 Why didn't you do this? doe2#gmail.edu
Smith, Bob 0 0 smith9999#gmail.com
...which will be used to set up a formatted output based on the parsed text.
I'm having trouble fully understanding how the block of code is parsing and holding the information so that I can know how to access certain parts of the information I want. Could someone better explain what the above code is doing at each step?
This is actually looks kind of a really crappy way to parse a CSV file.
while ($line = <STDIN>) { #read from STDIN 1 line at a time.
#flds = split("\t", $line); #Split the line into an array using the tab character and assign to #flds
foreach $fld (#flds) { #Loop through each item/column that's in the array #fld and assign the value to $fld
if ($fld =~ s/^"(.*)"$/\1/) { #Does the column have a string that is surrounded in quotes? If it does, replace it with the string only.
$fld =~ s/""/"/g; #Replace any strings that are only two double quotes.
}
}
print join("\t", #flds), "\n"; #Join the string back together using the tab character and print it out. Append a line break at the end.
}
I have a csv file like this:
ELAPSEDTIME_SEC;CPU_%;RSS_KBYTES
0;3.4;420012
1;3.4;420012
2;3.4;420012
3;3.4;420012
4;3.4;420012
5;3.4;420012
And I'd like to convert the values (they are seconds) in the first column to hh:mm:ss format (or whatever Excel or LibreOffice can import as time format from csv) and insert it back to the file into a new column following the first. So the output would be something like this:
ELAPSEDTIME_SEC;ELAPSEDTIME_HH:MM:SS;CPU_%;RSS_KBYTES
0;0:00:00;3.4;420012
1;0:00:01;3.4;420012
2;0:00:02;3.4;420012
3;0:00:03;3.4;420012
4;0:00:04;3.4;420012
5;0:00:05;3.4;420012
And I'd have to do this in Bash to work under Linux and OS X as well.
I hope this is what you want:
TZ=UTC awk -F';' -vOFS=';' '
{
$1 = $1 ";" (NR==1 ? "ELAPSEDTIME_HH:MM:SS" : strftime("%H:%M:%S", $1))
}1' input.csv
By thinking about your question I found an interesting manipulation possibility: Insert a formula into the CSV, and how to pass it to ooCalc:
cat se.csv | while read line ; do n=$((n+1)) ; ((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} ||echo ${line/;/;time of A;} ;done > se2.csv
formatted:
cat se.csv | while read line ; do
n=$((n+1))
((n>1)) && echo ${line/;/';"=time(0;0;$A$'$n')";'} || echo ${line/;/;time of A;}
done > se2.csv
Remarks:
This adds a column - it doesn't replace
You have to set the import options for CSV correctly. In this case:
delimiter = semicolon (well, we had to do this for the original file as well)
text delimiter = " (wasn't the default)
deactivate checkbox "quoted field as text"
depending on your locale, the function name has to be translated. For example, in German I had to use "zeit(" instead of "time("
since formulas use semicolons themselves the approach will be simpler, not needing that much masking, if the delimiter is something else, maybe a tab.
In practice, you might treat the headline like all the other lines, and correct it manually in the end, but the audience of SO expects everything to work out of the box, so the command became something longer.
I would have preferred to replace the whole while read / cat/ loop thing with just a short sed '...' command, and I found a remark in the man page of sed, that = can be used for the rownum, but I don't know how to handle it.
Result:
cat se2.csv
ELAPSEDTIME_SEC;time of A;CPU_%;RSS_KBYTES
0;"=time(0;0;$A$2)";3.4;420012
1;"=time(0;0;$A$3)";3.4;420012
2;"=time(0;0;$A$4)";3.4;420012
3;"=time(0;0;$A$5)";3.4;420012
4;"=time(0;0;$A$6)";3.4;420012
5;"=time(0;0;$A$7)";3.4;420012
In this specific case, the awk-solution seems better, but I guess this approach might sometimes be useful to know.