How can I get the second column of a very large csv file using linux command? - linux

I was given this question during an interview. I said I could do it with java or python like xreadlines() function to traverse the whole file and fetch the column, but the interviewer wanted me to just use linux cmd. How can I achieve that?

You can use the command awk. Below is an example of printing out the second column of a file:
awk -F, '{print $2}' file.txt
And to store it, you redirect it into a file:
awk -F, '{print $2}' file.txt > output.txt

You can use cut:
cut -d, -f2 /path/to/csv/file

I'd add to Andreas answer, but can't comment yet.
With csv, you have to give awk a field seperator argument, or it will define fields bound by whitespace instead of commas. (Obviously, csv that uses a different field seperator will need a different character to be declared.)
awk -F, '{print $2}' file.txt

Related

Extract Column(s) from text file having Multi Character Delimiter i.e. "%$%"

I have tried different solution for the problem given on the forum but doesn't work for the specified Delimiter %$%, I need to extract one specific column from the file containing 200+ columns.
I tried the following:
awk -F"%$%" '{print $1}' sample.txt > outfile.txt
awk 'gsub("%$%",":")' sample.txt > outfile.txt
The symbol $ is a special character in a regex, so you need to escape it with a \, which is also a special character for the string literal, so it needs to be escaped again.
So, finally we have:
$ cat sample
ghkjlj;lk%$%23e;k32poek%$%eqdje2oijd%$%xrgtdy5h
$ awk -F'%\\$%' '{print $1}' sample
ghkjlj;lk
no matter -F (FS) or gsub(), it expects a regex, you need either use character class or escape those chars with special meaning, like $ in your example.
kent$ awk -F'%[$]%' '{print $1}' <<<"foo%$%bar%$%blah"
foo
If you just want to change the separator, you can do with gsub or using OFS:
kent$ awk -F'%[$]%' -v OFS=":" '$1=$1' <<<"foo%$%bar%$%blah"
foo:bar:blah
kent$ awk 'gsub(/%[$]%/,":")+1' <<<"foo%$%bar%$%blah"
foo:bar:blah

Extract group name from one line repeatedly?

I got output from command like below. Need to extract group names.
dsAttrTypeNative:memberOf: CN=Grupa_test,OU=Groups,DC=yellow,DC=com CN=Firefox_Install,OU=Groups,DC=yellow,DC=com CN=Network_Admin,OU=Groups,DC=yellow,DC=com
So I would like to have something like:
Grupa_test
Firefox_Install
Network_Admin
Amount of groups will be different each time so I'm not sure how to achieve that.
$ awk -v RS=' ' -F'[=,]' 'NR>1{print $2}' file
Grupa_test
Firefox_Install
Network_Admin
The above will work with any awk.
You can do it with GNU grep:
grep -oP '(?<=CN=)[^,]*' file
try with following awk too once.
awk -v RS='[ ,]' -v FS="=" '/CN=/{print $2}' Input_file
$ awk -v FPAT="CN=[^,]+" '{for(i=1;i<=NF;i++)print substr($i,4)}' Input_file
Treat every matched CN=[^,]+ case as a field. And for each matched field, use substr($i,4) to filter out CN=, to print the desired string.

Unix (ksh) script to read file, parse and output certain columns only

I have an input file that looks like this:
"LEVEL1","cn=APP_GROUP_ABC,ou=dept,dc=net","uid=A123456,ou=person,dc=net"
"LEVEL1","cn=APP_GROUP_DEF,ou=dept,dc=net","uid=A123456,ou=person,dc=net"
"LEVEL1","cn=APP_GROUP_ABC,ou=dept,dc=net","uid=A567890,ou=person,dc=net"
I want to read each line, parse and then output like this:
A123456,ABC
A123456,DEF
A567890,ABC
In other words, retrieve the user id from "uid=" and then the identifier from "cn=APP_GROUP_". Repeat for each input record, writing to a new output file.
Note that the column positions aren't fixed, so can't rely on positions, guessing I have to search for the "uid=" string and somehow use the position maybe?
Any help much appreciated.
You can do this easily with sed:
sed 's/.*cn=APP_GROUP_\([^,]*\).*uid=\([^,]*\).*/\2,\1/'
The regex captures the two desired strings, and outputs them in reverse order with a comma between them. You might need to change the context of the captures, depending on the precise nature of your data, because the uid= will match the last uid= in the line, if there are more than one.
You can use awk to split in columns, split by ',' and then split by =, and grab the result. You can do it easily as awk -F, '{ print $5}' | awk -F= '{print $2}'
Take a look at this line looking at the example you provided:
cat file | awk -F, '{ print $5}' | awk -F= '{print $2}'
A123456
A123456
A567890

cat passwd | awk -F':' '{printf $1}' Is this command correct?

I'd like to know how cat passwd | awk -F':' '{printf $1}' works. cat /etc/passwd is a list of users with ID and folders from root to the current user (I don't know if it has something to do with cat passwd). -F is some kind of input file and {printf $1} is printing the first column. That's what I've search so far but seems confusing to me.
Can anyone help me or explain to me if it's right or wrong, please?
This is equivalent to awk -F: '{print $1}' passwd. The cat command is superfluous as all it does is read a file.
The -F option determines the field separator for awk. The quotes around the colon are also superfluous since colon is not special to the shell in this context. The print invocation tells awk to print the first field using $1. You are not passing a format string, so you probably mean print instead of printf.

How to reverse order of fields using AWK?

I have a file with the following layout:
123,01-08-2006
124,01-09-2007
125,01-10-2009
126,01-12-2010
How can I convert it into the following by using AWK?
123,2006-08-01
124,2007-09-01
125,2009-10-01
126,2009-12-01
Didn't read the question properly the first time. You need a field separator that can be either a dash or a comma. Once you have that you can use the dash as an output field separator (as it's the most common) and fake the comma using concatenation:
awk -F',|-' 'OFS="-" {print $1 "," $4,$3,$2}' file
Pure awk
awk -F"," '{ n=split($2,b,"-");$2=b[3]"-"b[2]"-"b[1];$i=$1","$2 } 1' file
sed
sed -r 's/(^.[^,]*,)([0-9]{2})-([0-9]{2})-([0-9]{4})/\1\4-\3-\2/' file
sed 's/\(^.[^,]*,\)\([0-9][0-9]\)-\([0-9][0-9]\)-\([0-9]\+\)/\1\4-\3-\2/' file
Bash
#!/bin/bash
while IFS="," read -r a b
do
IFS="-"
set -- $b
echo "$a,$3-$2-$1"
done <"file"
Unfortunately, I think standard awk only allows one field separator character so you'll have to pre-process the data. You can do this with tr but if you really want an awk-only solution, use:
pax> echo '123,01-08-2006
124,01-09-2007
125,01-10-2009
126,01-12-2010' | awk -F, '{print $1"-"$2}' | awk -F- '{print $1","$4"-"$3"-"$2}'
This outputs:
123,2006-08-01
124,2007-09-01
125,2009-10-01
126,2010-12-01
as desired.
The first awk changes the , characters to - so that you have four fields separated with the same character (this is the bit I'd usually use tr ',' '-' for).
The second awk prints them out in the order you specified, correcting the field separators at the same time.
If you're using an awk implementation that allows multiple FS characters, you can use something like:
gawk -F ',|-' '{print $1","$4"-"$3"-"$2}'
If it doesn't need to be awk, you could use Perl too:
$ perl -nle 'print "$1,$4-$3-$2" while (/(\d{3}),(\d{2})-(\d{2})-(\d{4})\s*/g)' < file.txt

Resources