pasting file side by side - python-3.x

I have many ascii files in a directory, i just want to sort the file name numerically and want to paste side by side.
Secondly, after pasting i want to make all the column of same length by appending zero at the end.
My files are named as
data_Z_1 data_N_457 data_E_45
1.5 1.2 2.3
2.0 2.3 1.8
4.5
At first I just want sort the above file names numerically as given below and then want to paste side by side as
data_Z_1 data_E_45 data_N_457
1.5 2.3 1.2
2.0 1.8 2.3
4.5
Secondly i need to make all the columns equal length in a pasted file, so that output should be like
1.5 2.3 1.2
2.0 1.8 2.3
0.0 0.0 4.5
I tried as below:
ls data_*_* | sort -V
But it doesnot work.Can anybody help me overcoming this problem.Thanks in advance.

Would you please try the following:
paste $(ls data* | sort -t_ -k3n) | awk -F'\t' -v OFS='\t' '
{for (i=1; i<=NF; i++) if ($i == "") $i = "0.0"} 1'
Output:
1.5 2.3 1.2
2.0 1.8 2.3
0.0 0.0 4.5
sort -t_ -k3n sets the field separator to _ and numerically sorts
the filenames on the 3rd field values.
The options -F'\t' -v OFS='\t' to the awk command assign
input/output field separator to a tab character.
The awk statement for (i=1; i<=NF; i++) if ($i == "") $i = "0.0"
scans the input fields and sets 0.0 for the empty fields.
The final 1 is equivalent to print $0 to print the fields.
[Edit]
If you have huge number of files, it may exceed the capability of bash. Here is an alternative with python using dataframe.
#!/usr/bin/python
import glob
import pandas as pd
import re
files = glob.glob('data*')
files.sort(key=lambda x: int(re.sub(r'.*_', '', x))) # sort filenames numerically by its number
dfs = [] # list of dataframes
for f in files:
df = pd.read_csv(f, header=None, names=[f]) # read file and assign column
df = df.apply(pd.to_numeric, errors='coerce') # force the cell values to floats
dfs.append(df) # add as a new column
df = pd.concat(dfs, axis=1, join='outer') # create a dataframe from the list of dataframes
df = df.fillna(0) # fill empty cells
print(df.to_string(index=False, header=False)) # print the dataframe removing index and header
which will produce the same results.

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

check if any value of a column in csv file exists in a column in a second csv file

I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values.
Ex:
file1:
id
value
abc
789
efg
766
hij
456
file2:
id
value
klm
789
nop
766
abc
456
I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out.
Is there a bash script that can do this?
Using awk:
awk -F, 'FNR==1 { next } NR==FNR { map[$1]=$2;next } map[$1]!="" { print;print $1"\t"map[$1] } ' file1 file2
If the line number is 1 (FNR==1), skip to the next line. When processing the first file (NR=FNR), create an array map with the first space separated field as the index and the second field the value. Then, when processing the second file, if there is an entry for the first field in map, print the line along with the entry in the map array.
If you are using python, you can use pandas library.
import pandas as pd
df = pd.DataFrame(yourdata, columns = ['X', 'Y', 'Z'])
duplicate = df[df.duplicated()]
print(duplicate)
For more detailed info, you can check this page.
Using join (and tail and sort and Bash's process substitution):
$ join -j 1 -o "1.1" <(tail -n +2 file1 | sort) <(tail -n +2 file2 | sort)
abc
Explained:
join -j 1 -o "1.1" join on the first space-separated field, output the first field of the first file
<(...) Bash's process substitution
tail -n +2 file1 ditch the header
| sort join expects the files to be sorted
(Yeah, I'd use #RamanSailopal's awk solution, too, ++)

splitting the file based on repeatation

experts i have a file as below where first column is repeated 0.0,5.0,10.Now i want to split the third column at repeatation of the first column row and want to arrange the data side by side as below:
0.0 0.0 0.50000E+00
5.0 0.0 0.80000E+00
10.0 0.0 0.80000E+00
0.0 1.0 0.10000E+00
5.0 1.0 0.90000E+00
10.0 1.0 0.30000E+00
0.0 2.0 0.90000E+00
5.0 2.0 0.50000E+00
10.0 2.0 0.60000E+00
so that my final file will be
0.50000E+00 0.10000E+00 0.90000E+00
0.80000E+00 0.90000E+00 0.50000E+00
0.80000E+00 0.30000E+00 0.60000E+00
so that my final file will be
Using GNU awk:
awk '{ map[$1][NR]=$3 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for ( j in map[i]) { printf "%s\t",map[i][j] } printf "\n" } }' file
Process each line and add to a two dimensional array map, with the first space delimited field as the first index and the line number the second. The third field is the value. At the end of processing the file, set the array ordering and then loop through the array printing the values in the format required.

combine two csv files based on common column using awk or sed [duplicate]

This question already has answers here:
How to merge two files using AWK? [duplicate]
(4 answers)
Closed 2 years ago.
I have a two CSV file which have a common column in both files along with duplicates in one file. How to merge both csv files using awk or sed?
CSV file 1
5/1/20,user,mark,Type1 445566
5/2/20,user,ally,Type1 445577
5/1/20,user,joe,Type1 445588
5/2/20,user,chris,Type1 445566
CSV file 2
Type1 445566,Name XYZ11
Type1 445577,Name AAA22
Type1 445588,Name BBB33
Type1 445566,Name XYZ11
What I want is?
5/1/20,user,mark,Type1 445566,Name XYZ11
5/2/20,user,ally,Type1 445577,Name AAA22
5/1/20,user,joe,Type1 445588,Name BBB33
5/2/20,user,chris,Type1 445566,Name XYZ11
So is there a bash command in Linux/Unix to achieve this? Can we do this using awk or sed?
Basically, I need to match column 4 of CSV file 1 with column 1 of CSV file 2 and merge both csv's.
Tried following command:
Command:
paste -d, <(cut -d, -f 1-2 ./test1.csv | sed 's/$/,Type1/') test2.csv
Got Result:
5/1/20,user,Type1,Type1 445566,Name XYZ11
If you are able to install the join utility, this command works:
join -t, -o 1.1 1.2 1.3 2.1 2.2 -1 4 -2 1 file1.csv file2.csv
Explanation:
-t, identify the field separator as comma (',')
-o 1.1 1.2 1.3 2.1 2.2 format the output to be "file1col1, file1col2, file1col3, file2col1, file2col2`
-1 4 join by column 4 in file1
-2 1 join by column 1 in file2
For additional usage information for join, reference the join manpage.
Edit: You specifically asked for the solution using awk or sed so here is the awk implementation:
awk -F"," 'NR==FNR {a[$1] = $2; next} {print $1","$2","$3","$4"," a[$4]}' \
file2.csv \
file1.csv
Explanation:
-F"," Delimit by the comma character
NR==FNR Read the first file argument (notice in the above solution that we're passing file2 first)
{a[$1] = $2; next} In the current file, save the contents of Column2 in an array that uses Column1 as the key
{print $1","$2","$3","$4"," a[$4]} Read file1 and using Column4, match the value to the key's value from the array. Print Column1, Column2, Column3, Column4, and the key's value.
The two example input files seem to be already appropriately sorted, so you just have to put them side by side, and paste is good for this; however you want to remove some ,-separated columns from file1, and you can use cut for that; but you also want to insert another (constant) column, and sed can do it. A possible command is this:
paste -d, <(cut -d, -f 1-2 file1 | sed 's/$/,abcd/') file2
Actually sed can do the whole processing of file1, and the output can be pided into paste, which uses - to capture it from the standard input:
sed -E 's/^(([^,]+,){2}).*/\1abcd/' file1 | paste -d, - file2

How to use awk to get the result of computation of column1 value of the same column2 value in 2 csv files in Ubuntu?

I am using ubuntu and we got a csv file1.csv with 2 columns looks like
a,1
b,2
c,3
...
and another file2.csv with 2 columns looks like
a,4
b,3
d,2
...
Some of column 1 value appear in file1.csv but not in file2.csv and vice cersa and these values should not be in result.csv. Say the value of first column in file1.csv is x and the value of first column in file2.csv with the same column2 value is y. How to use awk to compute (x-y)/(x+y) of second lines of 2 csv files in Ubuntu to get the result.csv like this:
a,-0.6
b,-0.2
-0.6 is computed by (1-4)/(1+4)
-0.2 is computed by (2-3)/(2+3)
What about this?
$ awk 'BEGIN{FS=OFS=","} FNR==NR {a[$1]=$2; next} {if ($1 in a) print $1,(a[$1]-$2)/(a[$1]+$2)}' f1 f2
a,-0.6
b,-0.2
Explanation
BEGIN{FS=OFS=","} set input and output field separators as comma.
FNR==NR {a[$1]=$2; next} when processinig first file, store in the array a[] the values like a[first col]=second col.
{if ($1 in a) print $1,(a[$1]-$2)/(a[$1]+$2)} when looping through second file, on each line do: check if the first col is stored in the a[] array; if so, print (x-y)/(x+y), being x=stored value and y=current second column.

Resources