How to remove a specific string common in multiple lines in a CSV file using shell script? - linux

I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a shell script. For example the content of the CSV file is as mentioned.
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.frag 0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.vert 0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-6 0 0 0
In the above example I need the output as below
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
Can any of you please help me out with this?

You could use sed:
$ cat test.csv
"c:\abc\bcd\def\123\456", 1, 2
"c:\abc\bcd\def\234\456", 1, 2
"c:\abc\bcd\def\432\456", 3, 4
$ sed -i.bak -e 's/c\:\\abc\\bcd\\//1' test.csv
$ cat test.csv
"def\123\456", 1, 2
"def\234\456", 1, 2
"def\432\456", 3, 4
I am using sed here in this way:
sed -e 's/<SEARCH TERM>/<REPLACE_TERM>/<OCCURANCE>' FILE
where
<SEARCH TERM> is what we are looking for (in this case c:\abc\bcd\, but backslashes need to be escaped).
<REPLACE TERM> is what we want to replace it with, in this case nothing, and
<OCCURANCE> is which occurance of the item we want to replace, in this case the first item in each line.
(-i.bak stands for: Don't output, just edit this file. (but make a backup first))
Updated according to #david-c-rankin comment. He is right, make a backup before editing files in case you make a mistake.

# init variable
MaxPath="$( sed -n 's/,.*//p;1q' YourFile )"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
# search the biggest pattern to remove
while [ ${#MaxPath} -gt 0 ] && [ $( grep -c -v -E "${GrepPath}" YourFile ) -gt 0 ]
do
MaxPath="${MaxPath%%?}"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
done
# Adapt your file
if [ ${#MaxPath} -gt 0 ]
then
sed "s#${GrepPath}##" YourFile
fi
Assuming for the sample that there is no special regex char nor # in MaxPath
the grep -c -v -E is not optimized in term of performance (treat whle file each time where it can stop at first miss)

Related

GNU Awk - don't modify whitespaces

I am using GNU Awk to replace a single character in a file. The file is a single line with varying whitespacing between "fields". After passing through gawk all the extra whitespacing is removed and I end up with single spaces. This is completely unintended and I need it to ignore these spaces and only change the one character I have targeted. I have tried several variations, but I cannot seem to get gawk to ignore these extra spaces.
Since I know this will come up, I read from the end of the line for replacement because the whitespacing is arbitrary/inconsistent in the source file.
Command:
gawk -i inplace -v new=3 'NF {$(NF-5) = new} 1' ~/scripts/tmp_beta_weather_file
Original file example:
2020-07-01 18:29:51.00 C M -11.4 28.9 29 9 23 5.5 000 0 0 00020 044013.77074 1 1 1 3 0 0
Result after command above:
2020-07-01 18:30:51.00 C M -11.8 28.8 29 5 23 5.5 000 0 0 00020 044013.77143 3 1 1 3 0 0
it might be easier with sed
sed -E 's/([^ ]+)(( [^ ]+){5})$/3\2/' file
test and add -i for in-place edit.

How to delete the first column ( which is in fact row names) from a data file in linux?

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?
idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.
#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done
You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.
As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt
You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

AWK--Comparing the value of two variables in two different files

I have two text files A.txt and B.txt. Each line of A.txt
A.txt
100
222
398
B.txt
1 2 103 2
4 5 1026 74
7 8 209 55
10 11 122 78
What I am looking for is something like this:
for each line of A
search B;
if (the value of third column in a line of B - the value of the variable in A > 10)
print that line of B;
Any awk for doing that??
How about something like this,
I had some troubles understanding your question, but maybe this will give you some pointers,
#!/bin/bash
# Read intresting values from file2 into an array,
for line in $(cat 2.txt | awk '{print $3}')
do
arr+=($line)
done
# Linecounter,
linenr=0
# Loop through every line in file 1,
for val in $(cat 1.txt)
do
# Increment linecounter,
((linenr++))
# Loop through every element in the array (containing values from 3 colum from file2)
for el in "${!arr[#]}";
do
# If that value - the value from file 1 is bigger than 10, print values
if [[ $((${arr[$el]} - $val )) -gt 10 ]]
then
sed -n "$(($el+1))p" 2.txt
# echo "Value ${arr[$el]} (on line $(($el+1)) from 2.txt) - $val (on line $linenr from 1.txt) equals $((${arr[$el]} - $val )) and is hence bigger than 10"
fi
done
done
Note,
This is a quick and dirty thing, there is room for improvements. But I think it'll do the job.
Use awk like this:
cat f1
1
4
9
16
cat f2
2 4 10 8
3 9 20 8
5 1 15 8
7 0 30 8
awk 'FNR==NR{a[NR]=$1;next} $3-a[FNR] < 10' f1 f2
2 4 10 8
5 1 15 8
UPDATE: Based on OP's edited question:
awk 'FNR==NR{a[NR]=$1;next} {for (i in a) if ($3-a[i] > 10) print}'
and see how simple awk based solution is as compared to nested for loops.

Cutting Element in Unix Based on Column Value

Without a shell script, in a single line. What command can help you cut from a row based on the column value
For example:
In
118 Balboni,Steve 23
11 Baker,Doug 0
120 Armas,Tony 13
133 Allanson,Andy 5
158 Baines,Harold 13
33 Bando,Chris 1
44 Adduci,James 1
50 Aguayo,Luis 3
5 Allen,Rod 0
94 Anderson,Brady 1
IF there 3rd row is not zero, how do I remove the row entirely in one statement? Is this possible in unix?
Assuming that the question is really asking about 'if the third column is non-zero, do not print it' or (equivalently) 'only print the row if the third column is 0':
Using awk:
awk '$3 == 0' data
(If the third column is zero, print the input; otherwise, ignore it. You could add { print } after the 0 to make the action explicit.)
Using perl:
perl -nae 'print if $F[2] == 0' data
Using sed:
sed -n '/ 0$/p' data
Using grep:
grep '[^0-9]0$' input
This does the inplace replacement.
perl -i -F -pane 'undef $_ if($F[2]!=0)' your_file
tested:
> cat temp
118 Balboni,Steve 23
11 Baker,Doug 0
120 Armas,Tony 13
133 Allanson,Andy 5
158 Baines,Harold 13
33 Bando,Chris 1
44 Adduci,James 1
50 Aguayo,Luis 3
5 Allen,Rod 0
94 Anderson,Brady 1
>
>
> perl -i -F -pane 'undef $_ if($F[2]!=0)' temp
> cat temp
11 Baker,Doug 0
5 Allen,Rod 0
>
If you wish to print lines that have no third column as well as those in which the 3rd column is explicitly 0 (ie, if you consider a blank field to be zero), try:
awk '!$3'
If you do not want to print lines with only 2 columns, try:
awk 'NF>2 && !$3'

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources