How to get CSV dimensions from terminal - linux

Suppose I'm in a folder where ls returns Test.csv. What command do I enter to get the number of rows and columns of Test.csv (a standard comma separated file)?

Try using awk. It's best suited for well formatted csv file manipulations.
awk -F, 'END {printf "Number of Rows : %s\nNumber of Columns = %s\n", NR, NF}' Test.csv
-F, specifies , as a field separator in csv file.
At the end of file traversal, NR and NF have values of number of rows and columns respectively
Another quick and dirty approach would be like
# Number of Rows
cat Test.csv | wc -l
# Number of Columns
head -1 Test.csv | sed 's/,/\t/g' | wc -w

Although not a native solution using GNU coreutils, it is worth mentioning (since this is one of the top google results for such question) that xsv puts at your disposal a command to list the headers of a csv file, whose count returns obviously the number of columns.
# count rows
xsv count <filename>
# count columns
xsv headers <filename> | wc -l
For big files this is orders of magnitude faster than native solutions with awk and sed.

Related

Check if any file in a directory contains words in a constant file

i have a directory with n textfiles. Now, i want to check, if any of these files contains one (or more) words of a constant file.
These files are all dictionarys with a different amount of words. The constant file is a password list, where i want to check these words. The amount of correct hits should be saved in a variable. Also the word should be save (i think as an array) in a variable.
For example: file1 contains This is my dictionary, file2 contains And another one, my password list contains this is a test for the dictionary and we have no other one.
The hits from file1 are This is dictionary (n1=3 words) and from file2 and one (n2=2 words).
My present code is
#!/bin/bash
# program_call passwordlist.txt *.txt
passwordlist="$1"
dictionarys="$*"
for comparison in $dictionarys; do
cat $passwordlist $comparison| sort | uniq -d >${comparison}.compare
done
One of my biggest problems her is, that i've got a different amount of dictionarys. Maybe 2, maybe 200. Nevermind, all of these has to be checked against the passwordlist and the result (the amound of correct words and the correct word itselfs) has to be saved in his OWN variables. So i think two variable for each dictionary.
another way
$ for f in file{1,2};
do echo -n $f": ";
grep -iow -f <(tr ' ' '\n' <cons) $f |
wc -l;
done
file1: 3
file2: 2
convert the constants file one word per line, check the dictionary files for word match ignore case and count the matched occurrences.
My solution:
#!/bin/bash
# program_call_is /dictionarys/*.txt passwordlist.txt
dictionarys="$1"
shift
passwordlist="$*"
for comparison in $dictionarys; do
fgrep -x -f $passwordlist $comparison >${comparison}.compare
done

Extracting columns from multiple files into a single output file from the command line

Say I have a tab-delimited data file with 10 columns. With awk, it's easy to extract column 7, for example, and output that into a separate file. (See this question, for example.)
What if I have 5 such data files, and I would like to extract column 7 from each of them and make a new file with 5 data columns, one for the column 7 of each input file? Can this be done from the command line with awk and other commands?
Or should I just write up a Python script to handle it?
awk '{a[FNR] = a[FNR]" " $7}END{for(i=0;i<FNR;i++) print a[i]}'
a array holds each line from different files
FNR number of records read in current input file, set to zero at begining of each file.
END{for(i=0;i<FNR;i++) print a[i]} prints the content of array a on END of file
If the data is small enough to store it all in memory then this should work:
awk '{out[FNR]=out[FNR] (out[FNR]?OFS:"") $7; max=(FNR>max)?FNR:max} END {for (i=1; i<=max; i++) {print out[i]}}' file1 file2 file3 file4 file5
If it isn't then you would need something fancier which could seek around file streams or read single lines from multiple files (a shell loop with N calls to read could do this).

Word Location in a file

I was experimenting with the Linux terminal and I was wondering how to find the column number and row of a specific word.
For example:
grep -i "hello" desktop/test.file
The output was the line containing hello, however I also want it to show the column number and row.
To my understanding, grep can't do that. You'd have to script something out that counted the number of words that didn't match until it did match, and output that.
Here is how to do it with awk
awk '{for (i=1;i<=NF;i++) if ($i~/^hello$/) print "row="i,"column="NR}' file

Sorting large files by [M]M/[D]D/YYYY

I have these large tab-delimited text files that I want to sort by the date field (the 17th field). The issue is that the dates come in the format [M]M/[D]D/YYYY meaning that there are no leading zeros so dates can be:
3/3/2013,
4/17/2014,
12/4/2013
Is it possible to use the sort command to do this? I haven't been able to find an example that takes into account no leading zeros.
As a note, I've tried recalculating the date field to be days from a certain date and then sorting on that. This works but the read/write necessary to do this extra step takes a long long time.
If the date is at the start of the line:
sort -n -t/ -k3,3 -k1,1 -k2,2
Use the --debug option to sort if available to help
The following prefixes each lines with YYYYMMDD before passing it to sort, then removes the added characters.
<file.in perl -pe'
$_ = (
m{^(?:[^\t]*\t){16}(\d+)/(\d+)/(\d+)\t}
? sprintf("%04d%02d%02d", $3, $1, $2)
: " " x 8
) . $_;
' | sort | cut -b 9- >file.out

Chunk a large file based on regex (LInux)

I have a large text file and I want to chunk it to smaller files based on distinct value of a column , columns are separated by comma (it's a csv file) and there are lots of distinct values :
e.g.
1012739937,2006-11-28,d_02245211
1012739937,2006-11-28,d_02238545
1012739937,2006-11-28,d_02236564
1012739937,2006-11-28,d_01918338
1012739937,2006-11-28,d_02148765
1012739937,2006-11-28,d_00868949
1012739937,2006-11-28,d_01908448
1012740478,1998-06-26,d_01913689
1012740478,1998-06-26,i_4869
1012740478,1998-06-26,d_02174766
I want to chunk the file into smaller files such that each file contains records belonging to one year (one for records of 2006 , one for records of 1998 , etc)
(here we may have limited number of years , but I want to the same thing with larger number of distinct values of a specific column)
You can use awk:
awk -F, '{split($2,d,"-");print > d[1]}' file
Explanation:
-F, tells awk that input fields are separated by ','
split($2,d,"-") splits the second column (the date) by '-'
and puts the bits into the array 'd'
print > d[1] prints the whole input line into a file named after the year
A quick awk solution, if slightly fragile (assumes the second column, if it exists, always starts yyyy)
awk -F, '$2{print > (substr($2,0,4) ".csv")}' test.in
It will split input into files yyyy.csv; make sure they don't exist in your current directory or they will be overwritten.
A different awk take: use a slightly more complicated field separator:
awk -F '[,-]' '{print > $2}' file

Resources