Loop through each column in a CSV file and exporting distinct values to a file - linux

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.

Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser

Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

Related

How to display number of times a word repeated after a common pattern

I have a file which has N number of line
For example
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
How to get the below result using uniq command:
This/is/workshop/ =5
Okay so there are a couple tools you can utilize here. Familiarize yourself with grep, cut, and uniq. My process for doing something like this may not be ideal, but given your original question I'll try to tailor the process to the lines in the file you've given.
First you'll want to grep the file for the relevant strings. Then you can pass it through to cut, declaring the fields you want to include by specifying the delimiter and also the number of fields. Lastly, you can pipe this through to uniq to count it.
Example:
Contents of file.txt
This/is/workshop/1
This/is/workshop/2
This/is/workshop/3
This/is/workshop/4
This/is/workshop/5
Use grep, cut and uniq
$ grep "This/is/workshop/" file.txt | cut -d/ -f1-3 | uniq -c
5 This/is/workshop
To specify the delimiter in cut, you use the -d flag and the delimiter you want to use. Each field is what exists between delimiters, starting at 1. For this, we want the first three. Then just pipe it through to uniq to get the count you are after.

What do back brackets do in this bash script code?

so i'm doing a problem with bashscript, this one: ./namefreq.sh ANA should return a list of two names (on separate lines) ANA and RENEE, both of which have frequency 0.120.
Basically I have a file from table.csv shown in the code below that have names and a frequency number next to them e.g. Anna, 0.120
I'm still unsure what the `` does for this code, and I'm also struggling to understand how this code is able to print out two names with identical frequencies. The way I read the code is:
grep compares the word (-w) typed by the user (./bashscript.sh Anna) to the value of (a), which then uses the cut command to be able to compare the 2nd field of the line separated by the delimiter "," which is the frequency from the file table.csv and then | cut -f1 -d"," prints out the first fields which are the names with the same frequency
^ would this be correct?
thanks :)
#!/bin/bash
a=`grep -w $1 table.csv | cut -f2 -d','`
grep -w $a table.csv | cut -f1 -d',' | sort -d
When a command is in backticks or $(), the output of the command is subsituted back into the command in place of it. So if the file has Anna,0.120
a=`grep -w Anna table.csv | cut -f2 -d','`
will execute the grep and cut commands, which will output 0.120, so it will be equivalent to
a=0.120
Then the command looks for all the lines that match 0.120, extracts the first field with cut, and sorts them.

Check record length for fixed width files

In a Unix environment, I occasionally have some fixed width files for which I'd like to check the record lengths. For each file I'd like to catch if any records are not an appropriate line number for further investigation; appropriate size is known a priori.
If I want to check if all record lengths are the same, I simply run
zcat <gzipped file> | awk '{print length}' | sort -u
If there is more than one record length in the above command, then I run
zcat <gzipped file> | awk '{print length}' | nl -n rz -s "," > recordLenghts.csv
which stores a record length for row in the original file.
What: Is this an efficient method, or is there a better way of checking record length for a file?
Why: The reason I ask is that some of these files can be a few GB in size while gzipped. So this process can take a while.
With pure awk:
zcat <gzipped file> | awk '{printf "%0.6d,%s\n", NR, length}' > recordLenghts.csv
This way you will save one extra subprocess.

Find files from a folder when a specific word appears on at least a specific number of lines

How can I find the files from a folder where a specific word appears on more than 3 lines? I tried using recursive grep for finding that word and then using -c to count the number of lines where the word appears.
This command will recursively list the files in the current directory where word appears on more than 3 lines, along with the matches count for each file:
grep -c -r 'word' . | grep -v -e ':[0123]$' | sort -n -t: -k2
The final sort is not necessary if you don't want the results sorted, but I'd say it's convenient.
The first command in the pipeline (grep -c -r 'word' .) recursively finds every file in the current directory that contains word, and counts the occurrences for each file. The intermediate grep discards every count that is 0, 1, 2 or 3, so you just get counts greater than 3 (this is because -v in grep(1) inverts the sense of matching to select non-matching lines). The final sort step sorts the list according to the occurrences for each file; it sets the field delimiter to : and instructs sort(1) to do a numeric-based sorting using the 2nd field (the count) as the sort key.
Here's a sample output from some tests I ran:
./file1:4
./dir1/dir2/file3:5
./dir1/file2:8
If you just want the filenames without the match counts, you can use sed(1) to discard the :count portions:
grep -m 4 -c -r 'word' . | grep -v -e ':[0123]$' | sed -r 's/:[0-9]+$//'
As noted in the comments, if matches count is not important, in this case we can optimize the first grep with -m 4, which stops reading the file after 4 matching lines.
UPDATE
The solution above works fine up to a certain extent if used with small numbers, but it does not scale well for larger numbers. If you want to filter based on an arbitrary number, you can use awk(1) (and in fact it ends up being much more clean), like so:
grep -c -r 'word' . | awk -F: '$2 > 10'
The -F: argument to awk(1) is necessary; it instructs awk(1) to separate fields by : rather than the default (whitespace and tab). This solution generalizes well to any number.
Again, if matches count doesn't matter and all you want is to get a list of the filenames, do this instead:
grep -c -r 'word' . | awk -F: '$2 > 10 { print $1 }'

Comparing files off first x number of characters

I have two text files that both have data that look like this:
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
I want to compare the two files based off of the Mon-000101(as an example of one ID) characters to see where they differ. I tried some diff commands that I found in another question, which didn't work. I'm out of ideas so I'm turning to anybody with more experience than myself.
Thanks.
HazMatt:Desktop m$ diff NGC2264_classI_h7_notes.csv /Users/m/Downloads/allfitclassesI.txt
1c1
Mon-000399,100.25794,9.877631,12.732,12.579,0.94,I,-1.13,I,9.8,9.8,,"10,000dn vs 600dn brighter source at 6 to 12"" Mon-000402,100.27347,9.59Mon-146053,100.23425,9.571719,12.765,11.39,1.11,I,1.04,I,16.8,16.8,,"double 3"" confused with 411, appears brighter",,,,,,,,
\ No newline at end of file
---
Mon-146599 Mon-146599 4.54 I 4.54 III
\ No newline at end of file
This was my attempt and the output. The thing is, is that I know the files differ by eleven lines...corresponding to eleven mismatched values. I don't want to do this by hand (who would?). Maybe I'm misreading the diff output. But I'd expect more than this.
Have you tried :
diff `cat file_1 | grep Mon-00010` `cat file_2 | grep Mon-00010`
First sort both the files and then try using diff
sort file1 > file1_sorted
sort file2 > file2_sorted
diff file1_sorted file2_sorted
Sorting will help arranging both the files as per first ID field, so that you don't get unwanted mismatches.
I am not sure what you are searching, but I'll try to help. Otherwise you could give some examples of input files and desired output.
My input-files are:
prompt> cat in1.txt
Mon-000101,100.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000171,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,100.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,
and
prompt> cat in2.txt
Mon-000101,111.27242,9.608597,11.082,10.034,0.39,I,0.39,I,31.1,31.1,,double with 1355,,,,,,,,
Mon-000172,100.2923,9.52286,14.834,14.385,0.45,I,0.45,I,33.7,33.7,,,,,,,,,,
Mon-000174,122.27621,9.563802,11.605,10.134,0.95,I,1.29,I,30.8,30.8,,,,,,,,,,
If you are just interested in the "ID" (whatever that means) you have to seperate it. I assume the ID is the tag before the first comma, so it is possible to cut everything except the ID and compare:
prompt> diff <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
2c2
< Mon-000171
---
> Mon-000172
If the ID is more complicated you can grep with the use of regular expressions.
Additionally diff -y gives you a little graphical output of which lines are differing. You can use this to merely compare the complete file or use it with the cutting explained before:
prompt> diff -y <(cut -d',' -f1 in1.txt) <(cut -d',' -f1 in2.txt)
Mon-000101 Mon-000101
Mon-000171 | Mon-000172
Mon-000174 Mon-000174

Resources