Cleaning up CSV (artifacts and lack of spacing) - linux

My data was in this form. Where you can see that the 3rd column (2nd if you start with 0) touches the one before when it's values rise to the next order of magnitude. As well as artifacts in the last column that are from none data input being recorded.
17:10:39 2.039 26.84 4.6371E-9 -0.7$R200$O100
17:10:41 2.082 27.04 4.6334E-9 -0.4
17:10:43 1.980 26.97 4.6461E-9 0.3
17:10:45 2.031 26.87 4.6502E-9 1.0$R200
17:10:47 2.090 27.09 4.6296E-9 0.1
...
18:49:40 1.930226.34 2.8246E-5 7.1
18:49:42 2.031226.04 2.8264E-5 8.2
Now I did fix this all by hand by adding a "|" deliminator instead " ", and cutting away the few artifacts, but it was a pain.
So in the prospect of getting even larger data sets in the future from the same machine, are there any tips on how to write a script in python or if there are any linux based tools out there already to fix this csv/make a new fixed csv out of this ?

In linux shell:
cut -c 1-14 data.csv > DataA
cut -c 15-49 data.csv > DataB
paste DataA DataB | tr -s " " "\t" > DataC
cuts the csv into two parts, with the intersection being where they touch, also in the second part we cut away the added unwanted artifacts.
paste them together and change the delimiter for a tab as paste adds a tab
Now in case we'd want to stick to the "|" delimiter the next step could be
cat DataC | tr -s "\t" "|" > DataFinal
rm DataA DataB DataC
But this is purely optional

The data you are showing is not csv (or dsv), but plain text data with fixed field widths. Trying to read this as csv will be error prone.
Instead this data should be processed as fixed width with the following field widths:
8 / 6 / 6 / 10 (or 11) / 8 (or 7) / rest of line
See this question on how to parse fixed width fields in Python.

Related

Lifting over GWAS summary statististic file from build 38 to build 37

I am using the UCSC lift over tool and the associated chain to lift over the results of my GWAS summary statistic file (a tab separated file) from build 38 to build 37. The GWAS summary stat file looks like:
1 chr1_17626_G_A 17626 A G 0.016 -0.0332 0.0237 0.161
1 chr_20184_G_A 20184 A G 0.113 -0.185 0.023 0.419
Follwing is the UCSC tool with the associated chain I am using:
liftover: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
chain file: ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
I want to create a file in bed format from GWAS summary stat fle that is the required input by the tool, where I would like the first three columns to be tab separated and rest of the columns to be merged in a single column and separated by a non tab separator such as "." so as to preserve them while running the lift over. The first three columns of the input bed file would be:
awk '{print chr$1, $3-1, $3}' GWAS summary stat file > ucsc.input.file
#$1 = chrx - where x is chromosome number
#$2 position -1 for SNPs
#$3 bp position hg38 for SNPs
The above three are the required columns for the tool.
My questions are:
How can I use a non tab separator say ":" to merge rest of the columns of the GWAS summary stat file in one column?
After running the liftover, how can I unpack the columns separated by :?
I am not sure if this answers your questions but please take a look.
You can use awk to merge multiple columns by :
awk '{print $1 ":" $2 ":" $3}' file
and then say you want to replace : by tab in $1 then you can do
awk -F ":" '{gsub(/:/,"\t",$1)}1' file
Is this of any help?

I am trying to add multiple users from a CSV in Linux using CentOS [duplicate]

I have
while read $field1 $field2 $field3 $field4
do
$trimmed=$field2 | sed 's/ *$//g'
echo "$trimmed","$field3" >> new.csv
done < "$FEEDS"/"$DLFILE"
Now the problem is with read I can't make it split fields csv style, can I? See the input csv format below.
I need to get columns 3 and 4 out, stripping the padding from col 2, and I don't need the quotes.
Csv format with col numbers:
12 24(")25(,)26(")/27(Field2values) 42(")/43(,)/44(Field3 decimal values)
"Field1_constant_value","Field2values ",Field3,Field4
Field1 is constant and irrelevant. Data is quoted, goes from 2-23 inside the quotes.
Field2 fixed with from cols 27-41 inside quotes, with the data at the left and padded by spaces on the right.
Field3 is a decimal number with 1,2, or 3 digits before the decimal and 2 after, no padding. Starts at col 74.
Field4 is a date and I don't much care about it right now.
Yes, you can use read; all you've got to do is reset the environment variable IFS -- Internal Field Separator --, so that it won't split lines by its current value (default to whitespace), but by your own delimiter.
Considering an input file "a.csv", with the given contents:
1,2,3,4
2,3,4,5
6,3,2,1
You can do this:
IFS=','
while read f1 f2 f3 f4; do
echo "fields[$f1 $f2 $f3 $f4]"
done < a.csv
And the output is:
fields[1 2 3 4]
fields[2 3 4 5]
fields[6 3 2 1]
A couple of good starting points for you are here: http://backreference.org/2010/04/17/csv-parsing-with-awk/

Linux filtering a file by two columns and printing the output

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.
using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

script to compare two large 900 x 900 comma delimited files

I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files.
If you just want a rough answer, possibly the simplest thing is to do something like:
tr , \\n file1 > /tmp/output
tr , \\n file2 | diff - /tmp/output
That will convert each file to one column and run diff. You can compute the cells that differ from the line numbers of the output.
Simplest way with awk without accounting for newlines inside fields,quoted commas etc.
Print the same
awk 'BEGIN{RS=",|"RS}a[FNR]==$0;{a[NR]=$0}' file{,2}
Print differences
awk 'BEGIN{RS=",|"RS}FNR!=NR&&a[FNR]!=$0;{a[NR]=$0}' file{,2}
Print which are the same different
awk 'BEGIN{RS=",|"RS}FNR!=NR{print "cell"FNR (a[FNR]==$0?"":" not")" the same"}{a[NR]=$0}' file{,2}
Input
file
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
file2
1,2,3,4,5
2,7,1,9,12
1,1,1,1,12
Output
same
1
2
3
4
5
7
9
Different
2
1
12
1
1
1
1
12
Same different
cell1 the same
cell2 the same
cell3 the same
cell4 the same
cell5 the same
cell6 not the same
cell7 the same
cell8 not the same
cell9 the same
cell10 not the same
cell11 not the same
cell12 not the same
cell13 not the same
cell14 not the same
cell15 not the same

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Resources