Chunk a large file based on regex (LInux) - linux

I have a large text file and I want to chunk it to smaller files based on distinct value of a column , columns are separated by comma (it's a csv file) and there are lots of distinct values :
e.g.
1012739937,2006-11-28,d_02245211
1012739937,2006-11-28,d_02238545
1012739937,2006-11-28,d_02236564
1012739937,2006-11-28,d_01918338
1012739937,2006-11-28,d_02148765
1012739937,2006-11-28,d_00868949
1012739937,2006-11-28,d_01908448
1012740478,1998-06-26,d_01913689
1012740478,1998-06-26,i_4869
1012740478,1998-06-26,d_02174766
I want to chunk the file into smaller files such that each file contains records belonging to one year (one for records of 2006 , one for records of 1998 , etc)
(here we may have limited number of years , but I want to the same thing with larger number of distinct values of a specific column)

You can use awk:
awk -F, '{split($2,d,"-");print > d[1]}' file
Explanation:
-F, tells awk that input fields are separated by ','
split($2,d,"-") splits the second column (the date) by '-'
and puts the bits into the array 'd'
print > d[1] prints the whole input line into a file named after the year

A quick awk solution, if slightly fragile (assumes the second column, if it exists, always starts yyyy)
awk -F, '$2{print > (substr($2,0,4) ".csv")}' test.in
It will split input into files yyyy.csv; make sure they don't exist in your current directory or they will be overwritten.

A different awk take: use a slightly more complicated field separator:
awk -F '[,-]' '{print > $2}' file

Related

linux shell script delimiter

How to change delimiter from current comma (,) to semicolon (;) inside .txt file using linux command?
Here is my ME_1384_DataWarehouse_*.txt file:
Data Warehouse,ME_1384,Budget for HW/SVC,13/05/2022,10,9999,13/05/2022,27,08,27,08
Data Warehouse,ME_1384,Budget for HW/SVC,09/05/2022,10,9999,09/05/2022,45,58,45,58
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
Data Warehouse,ME_1384,Budget for HW/SVC,25/05/2022,10,9999,25/05/2022,7,54,7,54
It is very important that value of last two columns is number with 2 decimal places, so value of last 2 columns in first row for example is:"27,08"
That could be the main problem why delimiter couldn't be change in proper way.
I tried with:
sed 's/,/;/g' ME_1384_DataWarehouse_*.txt
and every comma sign has been changed, including mentioned value of the last 2 columns.
Is there anyone who can help me out with this issue?
With sed you can replace the nth occurrence of a certain lookup string. Example:
$ sed 's/,/;/4' file
will replace the 4th comma with a semicolon.
So, if you know you have 11 fields (10 commas), you can do
$ sed 's/,/;/g;s/;/,/10;s/;/,/8' file
Example:
$ seq 1 11 | paste -sd, | sed 's/,/;/g;s/;/,/10;s/;/,/8'
1;2;3;4;5;6;7;8,9;10,11
Your question is somewhat unclear, but if you are trying to say "don't change the last comma, or the third-to-last one", a solution to that might be
perl -pi~ -e 's/,(?![^,]+(?:,[^,]+,[^,]+)?$)/;/g' ME_1384_DataWarehouse_*.txt
Perl in isolation does not perform any loop over the input lines, but the -p option says to loop over input one line at a time, like sed, and print every line (there is also -n to simulate the behavior of sed -n); the -i~ says to modify the file, but save the original with a tilde added to its file name as a backup; and the regex uses a negative lookahead (?!...) to protect the two fields you want to exempt from the replacement. Lookaheads are a modern regex feature which isn't supported by older tools like sed.
Once you are satisfied with the solution, you can remove the ~ after -i to disable the generation of backups.
You can do this with awk:
awk -F, 'BEGIN {OFS=";"} {a=$NF;NF-=1; printf "%s,%s\n",$0,a} ' input_file
This should work with most awk version (do not count on Solaris standard awk)
The idea is to store the last element from row in variable, decrease the number of fields and then print using new delimiter, comma and stored last field.

Combine first two columns of a single csv file into another column

So I have a large CSV file (in Gb) where I have multiple columns, the first two columns are :
Invoice number|Line Item Number
I want a unix / linux /ubuntu command which can merge this two columns and create a new column which is separated by separator ':', so for eg : If invoice number is 64789544 and Line Item Number is 234533, then my Merged value should be
64789544:234533
Can it really be achieved, If yes can the merged column is possible to be added back to the source csv file.
You can use the following sed command:
$ cat large.csv
Invoice number|Line Item Number|Other1|Other2
64789544|234533|abc|134
64744123|232523|cde|awc
$ sed -i.bak 's/^\([^|]*\)|\([^|]*\)/\1:\2/' large.csv
$ cat large.csv
Invoice number:Line Item Number|Other1|Other2
64789544:234533|abc|134
64744123:232523|cde|awc
Just be aware that it will take a backup of your input file just in case so you need to have enough space in your file system.
Explanations:
s/^\([^|]*\)|\([^|]*\)/\1:\2/ this command will replace the first two field of your CSV separated by | and will replace the separator by : using back references what will merge the 2 columns.
If you are sure about what you are doing, you can change -i.bak in -i to avoid taking a backup of the CSV file.
Perhaps with this simple sed
sed 's/|/:/' infile

use uniq -d on a particular column?

Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?
Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).
You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.

Working with complex CSV from Linux command line

I have a complex CSV file (here is external link because even a small part of it wouldn't look nice on SO) where a particular column may be composed of several columns separated by space.
reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202 0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329 0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626 0.09410831331851431: 0.94917381047888194 0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527 0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246
For viewing I am using this trick column -s,$'\t' -t < *.csv | less -#2 -N -S which is an upgraded version borrowed from Command line CSV viewer. If I'm using this trick is explicitly clear what is the 1st 2nd 3rd ... column and what is the data which are composed of several space separated data in particular column.
My question is if there is any trick to manipulating such complex CSV? I know that I can use awk to filter 5th column, then from this filtered column filter again 2nd column to get the desired portion of complex data, but I need to watch if there wasn't another composed column before 5th (so I need to get actually 6th not 5th column etc) some columns may contain also mix of composed and non composed data. So awk is probably not right tool.
The CSV viewer link mentions a tool called csvlook which adds to output pipes as a separator. This could be more easy to filter because pipes will delimit columns and white spaces will delimit composed data on one column. But I cannot run csvlook with multiple delimiters (comma and tab) as I did for column so it did not generate data properly. What is the most comfortable way of handling this?
As long as your input doesn't contain columns with escaped embedded , chars., you should be able to parse it with awk, using , as the field separator; e.g.:
awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv
The above splits the 5th field into sub-fields by whitespace, using the split() function.
Take a look at cut command. You can specify a list of fields, or a range of fields.

How to get CSV dimensions from terminal

Suppose I'm in a folder where ls returns Test.csv. What command do I enter to get the number of rows and columns of Test.csv (a standard comma separated file)?
Try using awk. It's best suited for well formatted csv file manipulations.
awk -F, 'END {printf "Number of Rows : %s\nNumber of Columns = %s\n", NR, NF}' Test.csv
-F, specifies , as a field separator in csv file.
At the end of file traversal, NR and NF have values of number of rows and columns respectively
Another quick and dirty approach would be like
# Number of Rows
cat Test.csv | wc -l
# Number of Columns
head -1 Test.csv | sed 's/,/\t/g' | wc -w
Although not a native solution using GNU coreutils, it is worth mentioning (since this is one of the top google results for such question) that xsv puts at your disposal a command to list the headers of a csv file, whose count returns obviously the number of columns.
# count rows
xsv count <filename>
# count columns
xsv headers <filename> | wc -l
For big files this is orders of magnitude faster than native solutions with awk and sed.

Resources