Check record length for fixed width files - linux

In a Unix environment, I occasionally have some fixed width files for which I'd like to check the record lengths. For each file I'd like to catch if any records are not an appropriate line number for further investigation; appropriate size is known a priori.
If I want to check if all record lengths are the same, I simply run
zcat <gzipped file> | awk '{print length}' | sort -u
If there is more than one record length in the above command, then I run
zcat <gzipped file> | awk '{print length}' | nl -n rz -s "," > recordLenghts.csv
which stores a record length for row in the original file.
What: Is this an efficient method, or is there a better way of checking record length for a file?
Why: The reason I ask is that some of these files can be a few GB in size while gzipped. So this process can take a while.

With pure awk:
zcat <gzipped file> | awk '{printf "%0.6d,%s\n", NR, length}' > recordLenghts.csv
This way you will save one extra subprocess.

Related

How to clean output, prints the desired information with less CPU usage

I have 20GB log file, where it contains lots of fields, the field or column numbers 2 contains numbers. I use the below commands to print only column 2
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}'
the result of this command is:
"93711994166", Key
since i want only the number then i append the below command to my original command to clean the output:
| awk -F, '{print $1}' | sed 's/"//g'
the result is:
93711994166
my final purpose is to print only numbers having length other than 11 digits, therefore, I append the following to my final command:
-vE '^.{11}$'
so my final command is:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' | nawk -F "=" '{print $2}' | awk -F, '{print $1}' | sed 's/"//g' | grep -vE '^.{11}$' >/tmp/$file
this command takes long time to execute also causes high CPU usage. I want to achieve the following:
print all numbers with length not equal to 11 digits.
print all numbers that do not start with 93 (regardless of their length)
clean, effective and not cpu or memory costly command
I have another requirement which is to print also the numbers that not started with 93.
Note:
the log file contains lots of different lines but i use awk '/Read:ROP/' to work on the below output and extract numbers
Read:ROP (CustomerId="93700001865", Key=1, ActiveEndDate=2025-01-19 20:12:22, FirstCallDate=2018-01-08 12:30:30, IsFirstCallPassed=true, IsLocked=false, LTH={Data=["1|
MOC|07.07.2020 09:18:58|48000.0|119||OnPeakAccountID|480|19250||", "1|RECHARGE|04.07.2020 10:18:32|-4500.0|0|0", "1|RECHARGE|04.07.2020 10:18:59|-4500.0|0|0"], Index=0
}, LanguageID=2, LastKnownPeriod="Active", LastRechargeAmount=4500, LastRechargeDate=2020-07-04 10:18:59, VoucherRchFraudCounter=0, c_BlockPAYG=true, s_PackageKeyCount
er=13, s_OfferId="xyz", OnPeakAccountID_FU={Balance=18850});
20GB log file [...] zcat
Using zcat on 20GB log files is quite expensive. Check top when running your command line above.
It might be worth keeping the data from the first filtering step:
zcat /path to file location/$date*/logfile_*.dat.zip | awk '/Read:ROP/' > filter_data.out
and work with the filtered data. I assume here that this awk step can remove the majority of the data.
Bonus points: This step can be parallelized by running the zcat [...] |awk [...] pipe file-by-file, and you only need to do this once for each file.
The other steps don't look particularly expensive unless there are a lot of data lines left even after filtering.
sed '/.*Read:ROP.*([^=]="\([^"]*\)".*/!d; s//\1/'
/.../ - match regex
.*Read:ROP.* - match Read:ROP followed by anything with anything in front, ie. awk '/Read:ROP/'
([^=]*=" - match a (, followed by anything except =, then a =, then a ", ie. nawk -F "=" '{print $2}'
\([^"]*\) - match everythjing inside qoutes. I guess [0-9] would be fine also
".* - delete rest of line
! - if the line doesn't match the regex
d - remove the line
s - substitute
// - reuse the regex in /.../
\1 - substitute for first backreference, ie. for \([^"]*\)

Loop through each column in a CSV file and exporting distinct values to a file

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:
sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv
This works, but to me is very manual and not really scalable if there were like 100 columns.
The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.
NB: The first row has the header information.
The above code works, but I'm looking to streamline it somewhat.
Assuming headers can be used as file names for each column:
head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
cut -f$field -d',' test.csv | \
tail -n +2 | sort -u > "${name}.csv" ;
done
Explanation:
head - reads the first line
tr- replaces the , with new line
sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
nl - adds the field number
-ba - number all lines
-s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
while- reads trough field number/names
cut - selects the field
tail - removes the heading, not all tails have this option, you can replace with sed
sort -u - sorts and removes duplicates
>"$name.csv" - saves in the appropriate file name
note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser
Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.
awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
{ for(i=1; i<=ncols; ++i)
if (!seen[i ":" $i])
print $i >>cols[i] ".csv"}' CROWN.csv
If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.
You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)
Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:
0.001s Baseline test -- simply copy input file to an identical output file
0.242s tripleee -- this single-pass AWK script
0.561s Sorin -- multiple passes using simple shell script
2.154s Mihir -- multiple passes using AWK
Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.
An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

zcat file not working for gzip file

I have a .gz which I need to merge and do other manipulations with (without compressing it), but I am having trouble just using zcat or gzip -dc or awk, for example when I pass these value to less -S like this:
awk '{print $1}' <(gzip -dc file.gz) | less -S
I get the incorrect column printed. When I use just less -S to view the file, only the last few columns are printed. So I thought it was a problem with the delimiter, but I have tried importing in R some lines (it is too big to import the whole file), and it seems to be space delimited since all the columns are showing up when I do this:
x=read.table("file.gz", header=T, nrows=100)
But how do I read the lines correctly to use this file with zcat?
Thank you so much for your help!
If you want the whole line to be printed, try $0.
awk '{print $0}' <(gzip -dc file.gz) | less -S
If you want specific columns to be printed, use -F to specific field separator. For example, if you want first field of ':' separated fields from each line (like in /etc/passwd), try this command.
awk -F':' '{print $1}' <(gzip -dc passwd.gz) |less -S

Sed, Awk for combining the output of two cut statements

I'm trying to combine the below outputs into one command. The issue is that the field I'm trying to grab is in reverse order. I was told that cut doesn't support a "reverse" option and to use AWK for this purpose but it didn't end up working for my purpose. I'm trying to take the output of the ls- l against the /dev/block to return the partitions and automatically build a dd if= / of= for each outputted line based on the output of the command.
I tried piping the output to awk:
cut -d' ' -f23,25 ... | awk '{print $2,$1}'
however, the result was when using sed to input the prefix and suffix, it wasn't in the appropriate order.
I built the two statements below which individually return the expected output, just looking for the "right" way to combine both of these statements in the most efficient manner using sed / awk.
ls -l /dev/block/platform/msm_sdcc.1/by-name/ | cut -d' ' -f 25 | sed "s/^/dd if=/"
ls -l /dev/block/platform/msm_sdcc.1/by-name/ | cut -d' ' -f 23 | sed "s/.*/of=\/external_sd\/&.dsk/"
Any assistance will be appreciated.
Thank you.
If you're already using awk, I don't think you'll need cut or sed. You can probably do something like the following, though I'll have to trust you on the field numbers
ls -l /dev/block/platform/msm_sdcc.1/by-name | awk '{print "dd if=/"$25 " of=/" $23 ".dsk"}'
awk will split on all whitespace, not just the space character, so it's possible the fields will shift some, though it may be more reliable too.

How do I randomly merge two input files to one output file using unix tools?

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?
Random sort using -R:
$ sort -R file1 file2 -o file3
My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'
This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

Resources