Print the results of a grep to a csv file - linux

I am trying to print the results of grep to a CSV file. When I run this command grep FILENAME * in my directory I get this result : B140509184-#-02Jun2015-#-11:00-#-12:00-#-LT4-UGW-MAJAZ-#-I-#-CMAP-#-P-45088966.lif: FILENAME A20150602.0835+0400-0840+0400_DXB-GGSN-V9-PGP-16.xml.
What I want is to print the FILENAME part to a csv. Below is what I have tried so far, I am a bit lost in the part of how I should go about printing the result
BASE_DIR=/appl/virtuo/var/loader/spool/umtsericssonggsn_2013a/2013A/good
cd ${BASE_DIR}
grep FILENAME *
#grep END_TIME *
#grep START_TIME *
#my $Filename = grep FILENAME *;
print $Filename;
#$ awk '{print;}' employee.txt
#echo "$Filename :"
File sample
measInfoId 140509184
GP 3600
START_DATE 2015-06-02
Output should be 140509184 3600 2015-06-02, in columns of a csv file

Updated Answer
If you only want lines that start with GP, or START_DATE or measInfoId, you would modify the awk commands below to look like this:
awk '/^GP/ || /^START_DATE/ || /^measInfoId/ {print $2}' file
Original Answer
I am not sure what you are trying to do actually, but this may help...
If your input file is called file, and contains this:
measInfoId 140509184
GP 3600
START_DATE 2015-06-02
This command will print the second field on each line:
awk '{print $2}' file
140509184
3600
2015-06-02
Then, building on that, this command will put all those onto a single line:
awk '{print $2}' file | xargs
140509184 3600 2015-06-02
And then this one will translate the spaces into commas:
awk '{print $2}' file | xargs | tr " " ","
140509184,3600,2015-06-02
And if you want to do that for all the files in a directory, you can do this:
for f in *; do
awk '{print $2}' "$f" | xargs | tr " " ","
done > result.cv

Related

Fastest way to compare hundreds of thousands of files, and create output results file in bash

I have the following:
-Values File, values.txt
-Directory Structure: ./dataset/label/author/files.txt
-Tens of thousands of files.txt's
-A file called targets.txt, which contains the location of every files.txt
Example targets.txt
./dataset/tallperson/Jabba/awesome.txt
./dataset/fatperson/Detox/toxic.txt
I have a file called values.txt, which contains hundreds of thousands of lines of values. These values are things like "aef", "; i", "jfk", etc. Random 3-Character lines.
I also have tens of thousands of files, each which also contain hundreds to thousands of lines. Each line also contains Random 3-Character lines.
The values.txt was created using the values of each files.txt. Therefore, there is no value in any file.txt file which isn't contained in values.txt. values.txt contains NO repeating values.
Example:
./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
The above is just example data. Each file will contain hundreds of lines. And values.txt will contain hundreds of thousands of lines.
My goal here is to make one file, where each line is a file. Each line will contain N values where each value is correspondant to the line in values.txt. And each value will be seperated by a comma. Each value is calculated simply by how many times each file contains the value of each line in values.txt.
The result should look something like this. With line 1 being file1.txt and line 2 being file2.txt.
Result.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,
Now. The last thing is, after getting this result I would like to add a label. The label is equivalent to the Nth parent directory from the file. For this example, lets say the 2nd parent directory. Therefore the label would be "tallperson" or "shortperson". As a result, the new Results.txt file would look like this.
Results.txt
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
I would like a way to accomplish all of this, but I need it to be fast as I am working with a very large scale dataset.
This is my current code, but it's too slow. The bottleneck is line 2.
Script. Each file located at "./dataset/label/author/file.java"
1 while IFS= read file_name; do
2 cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt;
3 label=$(echo "$file_name" | cut -d '/' -f 3);
4 printf "$label\n" >> Results.txt;
5 done < targets.txt
------------
To REPLICATE this problem. Do the following:
mkdir -p dataset/{label1,label2}
touch file1.txt; chmod 777 file1.txt
touch file2.txt; chmod 777 file2.txt
echo "Enter anything here" > file1.txt
echo "Enter something here too" > file2.txt
mv file1.txt ./dataset/label1
mv file2.txt ./dataset/label2
find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done
find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt
xargs cat < targets.txt | sort -u > values.txt
With the above UNCHANGED, you should get a values.txt with something similar to below. If there's any lines with less or more than 3 characters for some reason, please delete the line.
any
e
Ent
er
eth
he
her
ing
ng
re
som
thi
too
You should get a targets.txt file
./dataset/label2/modified-file2.txt
./dataset/label1/modified-file1.txt
From here. The goal is to check every file in targets.txt, and count how many values the file has contained in values.txt. And to output the results with the label to Results.txt
The following script will work for this example, but I need it to be way faster for large scale operations.
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt;
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n" >> Results.txt;
done < targets.txt
Here's another example
Example 2:
./dataset/weirdperson/Crooked/file1.txt
LOL
LOL
HAHA
./dataset/awesomeperson/Mild/file2.txt
LOL
LOL
LOL
values.txt
LOL
HAHA
Result.txt
2,1,weirdperson
3,0,awesomeperson
Here's a solution in Python, using its ordered dictionary datatype.
import os
from collections import OrderedDict
# read samples from values.txt into an Ordered Dict.
# each dict key is a line from the file
# (including the trailing newline, but that doesn't matter)
# each dict value is 0
with open('values.txt', 'r') as f:
samplecount0=OrderedDict((sample, 0) for sample in f.readlines())
# get list of filenames from targets.txt
with open('targets.txt', 'r') as f:
targets=[t.rstrip('\n') for t in f.readlines()]
# for each target,
# read its lines of samples
# increment the corresponding count in samplecount
# print out samplecount in a single line separated by commas
# each line also has the 2nd-to-last directory component of the target's pathname
for target in targets:
with open(target, 'r') as f:
# copy samplecount0 to samplecount so we don't have to read the values.txt file again
samplecount=samplecount0.copy()
# for each sample in the target file, increment the samplecount dict entry
for tsample in f.readlines():
samplecount[tsample] += 1
output = ','.join(str(v) for v in samplecount.values())
output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target)))
print(output)
Output:
$ python3 doit.py
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Try this:
<targets.txt xargs -n1 -P4 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
The -P4 let's you parallelize the jobs in targets.txt. The short awk script marges lines and prints 0 and 1 followed by a comma. Then sed is used to append the 3rd part of the folder path to the end of the line. The sed line looks strange, because I used unprintable character $'\x01' as the separator for s command.
Tested with:
mkdir -p ./dataset/weirdperson/Crooked
cat <<EOF >./dataset/weirdperson/Crooked/file1.txt
LOL
hel
lo
how
are
you
on
thi
s f
ine
day
EOF
mkdir -p ./dataset/awesomeperson/Mild/
cat <<EOF >./dataset/awesomeperson/Mild/file2.txt
I a
m v
ery
goo
d.
Tha
nks
LOL
EOF
cat <<EOF >values.txt
are
you
on
thi
s f
ine
day
goo
d.
Tha
hel
lo
how
I a
m v
ery
nks
LOL
EOF
cat <<EOF >targets.txt
./dataset/weirdperson/Crooked/file1.txt
./dataset/awesomeperson/Mild/file2.txt
EOF
measure_start() {
declare -g ttic_start
echo "==> Test $* <=="
ttic_start=$(date +%s.%N)
}
measure_end() {
local end
end=$(date +%s.%N)
local start
start="$ttic_start"
ttic_runtime=$(python -c "print(${end} - ${start})")
echo "Runtime: $ttic_runtime"
echo
}
measure_start original
while IFS= read file_name; do
cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d,"
label=$(echo "$file_name" | cut -d '/' -f 3);
printf "$label\n"
done < targets.txt
measure_end
measure_start first try with bash
nl -w1 values.txt | sort -k2.2 > values_sorted.txt
< targets.txt xargs -n1 -P0 bash -c "
sort -t$'\t' \"\$1\" |
join -t$'\t' -12 -21 -eEMPTY -a1 -o1.1,2.1 values_sorted.txt - |
sort -s -n -k1.1 |
sed 's/.*\tEMPTY/0/;t;s/.*/1/' |
tr '\n' ',' |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
measure_start second try with awk
<targets.txt xargs -n1 -P0 bash -c "
awk 'NR==FNR{a[\$0];next} {if (\$0 in a) {printf \"1,\"} else {printf \"0,\"}}' \"\$1\" values.txt |
sed $'s\x01$\x01'\"\$(<<<\"\$1\" cut -d/ -f3)\"'\n'$'\x01'
" --
measure_end
Outputs:
==> Test original <==
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
Runtime: 0.133769512177
==> Test first try with bash <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0322473049164
==> Test second try with awk <==
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
Runtime: 0.0180222988129

How to print files names from file where awk is selecting values?

I have a .txt file having files names as
z1.cap
z2.cap
z3.cap
z4.cap
Sample data present in these files are like shown below,
OTR 25896 PAT210 $TREMD DEST
OFR 21475 NAT102 #TREMD DEST
then I'm using below code to print desired values from files.
while read file_name
do
echo "progressing with file :${file_name}"
cat ${file_name} | grep "PAT210" | awk -F' ' '$5 == "(DEST" { print $file_name, $1}' | uniq >> OUTPUT_FILE
Now I want output which consists of 2 fields like,
z1.cap OTR
z2.cap OFR
and so on...
But i'm getting ouputs like,
- OTR
- OFR
Any help is aprreciated, Thanks.
To access the filename that awk is currently processing use the builtin variable FILENAME
To bind other shell variables from your shell to variables in awk use:
awk -v var1=$shvar1 -v var2=$shvar2 'your awk code using var1 and var2'
Assuming files.txt contains your list of files and with zero understanding of what exactly you are trying to achieve:
for file_name in $(cat files.txt)
do
echo "progressing with file :${file_name}"
awk -F' ' '($5 == "DEST") && ($3=="PAT210") { print FILENAME, $1}' $file_name | uniq >> OUTPUT_FILE
done
I removed the cat and incorporated the grep into your awk. The cat was unnecessary since awk can read the file itself.
You can remove the for loop entirely by saying
awk -F' ' '($5 == "DEST") && ($3=="PAT210") { print FILENAME, $1}' $(<files.txt) | uniq >> OUTPUT_FILE
The $(<files.txt) will send each filename to awk.

how to print tail of path filename using awk

I've searched it with no success.
I have a file with pathes.
I want to print the tail of a all pathes.
for example (for every line in file):
/homes/work/abc.txt
--> abc.txt
Does anyone know how to do it?
Thanks
awk -F "/" '{print $NF}' input.txt
will give output of:
abc1.txt
abc2.txt
abc3.txt
for:
$>cat input.txt
text path/to/file/abc1.txt
path/to/file/abc2.txt
path/to/file/abc3.txt
How about this awk
echo "/homes/work/abc.txt" | awk '{sub(/.*\//,x)}1'
abc.txt
Since .* is greedy, it will continue until last /
So here we remove all until last / with x, and since x is empty, gives nothing.
Thors version
echo "/homes/work/abc.txt" | awk -F/ '$0=$NF'
abc.txt
NB this will fail for /homes/work/0 or 0,0 etc so better use:
echo "/homes/work/abc.txt" | awk -F/ '{$0=$NF}1'
awk solutions are already provided by #Jotne and #bashophil
Here are some other variations (just for fun)
Using sed
sed 's:.*/::' file
Using grep
grep -oP '(.*/)?\K.*' file
Using cut - added by #Thor
rev file | cut -d/ -f1 | rev
Using basename - suggested by #fedorqui and #EdMorton
while IFS= read -r line; do
basename "$line"
done < file

Get N line from unzip -l

I have a jar file, i need to execute the files in it in Linux.
So I need to get the result of the unzip -l command line by line.
I have managed to extract the files names with this command :
unzip -l package.jar | awk '{print $NF}' | grep com/tests/[A-Za-Z] | cut -d "/" -f3 ;
But i can't figure out how to obtain the file names one after another to execute them.
How can i do it please ?
Thanks a lot.
If all you need the first row in a column, add a pipe and get the first line using head -1
So your one liner will look like :
unzip -l package.jar | awk '{print $NF}' | grep com/tests/[A-Za-Z] | cut -d "/" -f3 |head -1;
That will give you first line
now, club head and tail to get second line.
unzip -l package.jar | awk '{print $NF}' | grep com/tests/[A-Za-Z] | cut -d "/" -f3 |head -2 | tail -1;
to get second line.
But from scripting piont of view this is not a good approach. What you need is a loop as below:
for class in `unzip -l el-api.jar | awk '{print $NF}' | grep javax/el/[A-Za-Z] | cut -d "/" -f3`; do echo $class; done;
you can replace echo $class with whatever command you wish - and use $class to get the current class name.
HTH
Here is my attempt, which also take into account Daddou's request to remove the .class extension:
unzip -l package.jar | \
awk -F'/' '/com\/tests\/[A-Za-z]/ {sub(/\.class/, "", $NF); print $NF}' | \
while read baseName
do
echo " $baseName"
done
Notes:
The awk command also handles the tasks of grep and cut
The awk command also handles the removal of the .class extension
The result of the awk command is piped into the while read... command
baseName represents the name of the class file, with the .class extension removed
Now, you can do something with that $baseName

Extract list of file names in a zip archive when `unzip -l`

When I do unzip -l zipfilename, I see
1295627 08-22-11 07:10 A.pdf
473980 08-22-11 07:10 B.pdf
...
I only want to see the filenames. I try this
unzip -l zipFilename | cut -f4 -d" "
but I don't think the delimiter is just " ".
The easiest way to do this is to use the following command:
unzip -Z -1 archive.zip
or
zipinfo -1 archive.zip
This will list only the file names, one on each line.
The two commands are exactly equivalent. The -Z option tells unzip to treat the rest of the options as zipinfo options. See the man pages for unzip and zipinfo.
Assuming none of the files have spaces in names:
unzip -l filename.zip | awk '{print $NF}'
My unzip output has both a header and footer, so the awk script becomes:
unzip -l filename.zip | awk '/-----/ {p = ++p % 2; next} p {print $NF}'
A version that handles filenames with spaces:
unzip -l filename.zip | awk '
/----/ {p = ++p % 2; next}
$NF == "Name" {pos = index($0,"Name")}
p {print substr($0,pos)}
'
If you need to cater for filenames with spaces, try:
unzip -l zipfilename.zip | awk -v f=4 ' /-----/ {p = ++p % 2; next} p { for (i=f; i<=NF;i++) printf("%s%s", $i,(i==NF) ? "\n" : OFS) }'
Use awk:
unzip -l zipfilename | awk '{print $4}'

Resources