Subtract a constant number from a column - linux

I have two large files (~10GB) as follows:
file1.csv
name,id,dob,year,age,score
Mike,1,2014-01-01,2016,2,20
Ellen,2, 2012-01-01,2016,4,35
.
.
file2.csv
id,course_name,course_id
1,math,101
1,physics,102
1,chemistry,103
2,math,101
2,physics,102
2,chemistry,103
.
.
I want to subtract 1 from the "id" columns of these files:
file1_updated.csv
name,id,dob,year,age,score
Mike,0,2014-01-01,2016,2,20
Ellen,0, 2012-01-01,2016,4,35
file2_updated.csv
id,course_name,course_id
0,math,101
0,physics,102
0,chemistry,103
1,math,101
1,physics,102
1,chemistry,103
I have tried awk '{print ($1 - 1) "," $0}' file2.csv, but did not get the correct result:
-1,id,course_name,course_id
0,1,math,101
0,1,physics,102
0,1,chemistry,103
1,2,math,101
1,2,physics,102
1,2,chemistry,103

You've added an extra column in your attempt. Instead set your first field $1 to $1-1:
awk -F"," 'BEGIN{OFS=","} {$1=$1-1;print $0}' file2.csv
That semicolon separates the commands. We set the delimiter to comma (-F",") and the Output Field Seperator to comma BEGIN{OFS=","}. The first command to subtract 1 from the first field executes first, then the print command executes second, so the entire record, $0, will now contain the new $1 value when it's printed.
It might be helpful to only subtract 1 from records that are not your header. So you can add a condition to the first command:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1-1} {print $0}' file2.csv
Now we only subtract when the record number (NR) is greater than 1. Then we just print the entire record.

Related

How to insert a column at the start of a txt file using awk?

How to insert a column at the start of a txt file running from 1 to 2059 which corresponds to the number of rows I have in my file using awk. I know the command will be something like this:
awk '{$1=" "}1' File
Not sure what to put between the speech-marks 1-2059?
I also want to include a header in the header row so 1 should only go in the second row technically.
**ID** Heading1
RQ1293939 -7.0494
RG293I32SJ -903.6868
RQ19238983 -0899977
rq747585950 988349303
FID **ID** Heading1
1 RQ1293939 -7.0494
2 RG293I32SJ -903.6868
3 RQ19238983 -0899977
4 rq747585950 988349303
So I need to insert the FID with 1 - 2059 running down the first column
What you show does not work, it just replaces the first field ($1) with a space and prints the result. If you do not have empty lines try:
awk 'NR==1 {print "FID\t" $0; next} {print NR-1 "\t" $0}' File
Explanations:
NR is the awk variable that counts the records (the lines, in our case), starting from 1. So NR==1 is a condition that holds only when awk processes the first line. In this case the action block says to print FID, a tab (\t), the original line ($0), and then move to next line.
The second action block is executed only if the first one has not been executed (due to the final next statement). It prints NR-1, that is the line number minus one, a tab, and the original line.
If you have empty lines and you want to skip them we will need a counter variable to keep track of the current non-empty line number:
awk 'NR==1 {print "FID\t" $0; next} NF==0 {print; next} {print ++cnt "\t" $0}' File
Explanations:
NF is the awk variable that counts the fields in a record (the space-separated words, in our case). So NF==0 is a condition that holds only on empty lines (or lines that contain only spaces). In this case the action block says to print the empty line and move to the next.
The last action block is executed only if none of the two others have been executed (due to their final next statement). It increments the cnt variable, prints it, prints a tab, and prints the original line.
Uninitialized awk variables (like cnt in our example) take value 0 when they are used for the first time as a number. ++cnt increments variable cnt before its value is used by the print command. So the first time this block is executed cnt takes value 1 before being printed. Note that cnt++ would increment after the printing.
Assuming you don't really have a blank row between your header line and the rest of your data:
awk '{print (NR>1 ? NR-1 : "FID"), $0}' file
Use awk -v OFS='\t' '...' file if you want the output to be tab-separated or pipe it to column -t if you want it visually tabular.

find records longer/shorter than a particular col

this is my file: FILEABC.txt
Name|address|age|country
john|london|12|UK
adam|newyork|39|US|X12|123
jake|madrid|45|ESP
ram|delhi
joh|cal|34|US|788
I wanted to find the the header count in the file. so i've this command
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
the result i get for this cmd is
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
1 2
3 4
1 5
1 6
My requirement is that, how do i find those records that have only 2 fields, 4 fields and so on from my file.
for ex,
if want to see the records having only 2 col:
ram|delhi
if want to see rec's having more than 4 col:
adam|newyork|39|US|X12|123
If you want to only print the records which have 2 fields then following may help you in same.
awk -F"|" 'NF==2' Input_file
For any kind of records if you need a line which has more than 4 fields then change above condition to NF>4 or you need line which have more than 5 fields eg--> NF>5
Explanation: BY doing -F"|" I am making sure field separator is pipe here, then NF is an awk out of the box variable which defines the TOTAL number of fields in a line, so as per your request checking if number of fields are more than 2 here, if this condition is TRUE then print the current line(where I have NOT written print because awk works on method of condition and action, so if condition is TRUE here I am not mentioning any action and by default action print will happen for that line).
Using awk, variable NF gives total number of fields in record/row, by default awk use single space as field separator, if you alter FS, it will calculate NF based on field separator mentioned, so what you can do is
awk -v FS='|' 'NF==2' infile
Which is same as
# Usual Syntax : awk 'condition { action }' infile
awk -v FS='|' 'NF==2{ print }' infile
For more than 4 fields,
awk -v FS='|' 'NF > 4' infile
you can also use grep to filter 2-columed records:
grep '^[^|]*|[^|]*$' FILEABC.txt
It will output:
ram|delhi

How to cut column data from flat file

I've data in format below;
111,Ja,M,Oes,2012-08-03 16:42:00,x,xz
112,Ln,d,D,Gn,2012-08-03 16:51:00,y,yx
I need to create files with data in the sequence below:
111,x,xz
112,y,yz
In output format, we've first value before comma and last two comma prefix values. Here we can have any number of commas in-between.
Kindly advise, how can generate required output file from input file in Linux machine.
The Awk statement for this is pretty straight-forward. Set the input and output field separators and print the fields using $1..$NF, where $NF is the value of the last column,
awk 'BEGIN{FS=OFS=","}{print $1,$(NF-1),$NF}' input.csv > newfile.csv
Not much to this one in awk:
awk -F"," 'BEGIN{OFS=","}{print $1,$(NF-1), $NF}' inFile > outFile
We split the lines in awk with a comma -F"," and then print the first field $1, the second to last field $(NF-1), and the last field $NF.
NF is the "Number of fields" so subtracting 1 from it will give you the second to last item.
with sed
$ sed -r 's/([^,]+).*(,[^,]+,[^,]+)/\1\2/' file
111,x,xz
112,y,yx
or
$ sed -r 's/([^,]+).*((,[^,]+){2})/\1\2/' file
awk '{print substr($1,1,4) substr($2,10,4)}' file
111,x,xz
112,y,yx

comparing two files with different columns

i have the two files(count.txt, count1.txt). i need to do the following
1. get the values from count.txt and count1.txt where 1st column is equal.
2. if its equal need to compare the 2nd column like ((1st column value + 5) >= 2 column value)
count.txt
order1,150
order2,165
order3,125
count1.txt
order1,155
order2,170
order3,125
order4,123
and i want the output like below,
Output.txt
order1,155
order2,170
i have used below nawk command for the 1st point, but not able to complete the 2nd point. Please suggest to achieve the same.
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' count.txt count1.txt
nawk -F"," 'NR==FNR {a[$1]=$2;next} ($1 in a) && (a[$1]+5)<=$2' count.txt count1.txt

Separate comma delimited cells to new rows with shell script

I have a table with comma delimited columns and I want to separate the comma delimited values in my specified column to new rows. For example, the given table is
Name Start Name2
A 1,2 X,a
B 5 Y,b
C 6,7,8 Z,c
And I need to separate the comma delimited values in column 2 to get the table below
Name Start Name2
A 1 X,a
A 2 X,a
B 5 Y,b
C 6 Z,c
C 7 Z,c
C 8 Z,c
I am wondering if there is any solution with shell script, so that I can create a workflow pipe.
Note: the original table may contain more than 3 columns.
Assuming the format of your input and output does not change:
awk 'BEGIN{FS="[ ,]"} {print $1, $2, $NF; print $1, $3, $NF}' input_file
Input:
input_file:
A 1,2 X
B 5,6 Y
Output:
A 1 X
A 2 X
B 5 Y
B 6 Y
Explanation:
awk: invoke awk, a tool for manipulating lines (records) and fields
'...': content enclosed by single-quotes are supplied to awk as instructions
'BEGIN{FS="[ ,]"}: before reading any lines, tell awk to use both space and comma as delimiters; FS stands for Field Separator.
{print $1, $2, $NF; print $1, $3, $NF}: For each input line read, print the 1st, 2nd and last field on one line, and then print the 1st, 3rd, and last field on the next line. NF stands for Number of Fields, so $NF is the last field.
input_file: supply the name of the input file to awk as an argument.
In response to updated input format:
awk 'BEGIN{FS="[ ,]"} {print $1, $2, $4","$5; print $1, $3, $4","$5}' input_file
After Runner's modification of the original question another approach might look like this:
#!/bin/sh
# Usage $0 <file> <column>
#
FILE="${1}"
COL="${2}"
# tokens separated by linebreaks
IFS="
"
for LINE in `cat ${FILE}`; do
# get number of columns
COLS="`echo ${LINE} | awk '{print NF}'`"
# get actual field by COL, this contains the keys to be splitted into individual lines
# replace comma with newline to "reuse" newline field separator in IFS
KEYS="`echo ${LINE} | cut -d' ' -f${COL}-${COL} | tr ',' '\n'`"
COLB=$(( ${COL} - 1 ))
COLA=$(( ${COL} + 1 ))
# get text from columns before and after actual field
if [ ${COLB} -gt 0 ]; then
BEFORE="`echo ${LINE} | cut -d' ' -f1-${COLB}` "
else
BEFORE=""
fi
AFTER=" `echo ${LINE} | cut -d' ' -f${COLA}-`"
# echo "-A: $COLA ($AFTER) | B: $COLB ($BEFORE)-"
# iterate keys and re-build original line
for KEY in ${KEYS}; do
echo "${BEFORE}${KEY}${AFTER}"
done
done
With this shell file you might do what you want. This will split column 2 into multiple lines.
./script.sh input.txt 2
If you'd like to pass inputs though standard input using pipes (e.g. to split multiple columns in one go) you could change the 6. line to:
if [ "${1}" == "-" ]; then
FILE="/dev/stdin"
else
FILE="${1}"
fi
And run it this way:
./script.sh input.txt 1 | ./script.sh - 2 | ./script.sh - 3
Note that cut is very sensitiv about the field separators. Soif the line starts with a space character, column 1 would be "" (empty). If the fields were separated by amixture of spaces and tabs this script would have other issues too. In this case (as explained above) filtering the input resource (so that fields are only separated by one space character) should do it. If this is not possible or the data in each column contains space characters too, the script might get more complicated.

Resources