Replace NAs with their respective column means from very large text file - linux

I have a large text file: 400k rows and 10k columns, all numeric data values as 0,1,2. File size ranging 5-10GBs. I have a few missing values: NAs in the file. I want to replace the NA values with the column means, i.e. NA value in column 'x' must be replaced by the mean value of column 'x'. These are the steps that I want to do :
Compute means of each column of my text file (excluding the header and starting from column7th)
Replace NA in each column with their respective column means
Write the modified file back as a txt file
Data subset:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 NA 0 1
2346 2346 0 0 2 -9 1 2 NA 1
1334 1334 0 0 2 -9 2 NA 0 2
4566 4566 0 0 2 -9 2 2 NA 0
4567 4567 0 0 1 -9 NA NA 1 1
# total 400k rows and 10k columns
Desired Output:
# Assuming only 5 rows as given in the above example.
# Mean of column X1 = (0 + 1+ 2+ 2)/4 = 1.25
# Mean of column X2 = (2 + 2)/2 = 2
# Mean of column X3 = (0 + 0 + 1)/3 = 0.33
# Mean of column X4 = No NAs, so no replacements
# Replacing NAs with respective means:
IID FID PAT MAT SEX PHENOTYPE X1 X2 X3 X4......
1234 1234 0 0 1 -9 0 2 0 1
2346 2346 0 0 2 -9 1 2 0.33 1
1334 1334 0 0 2 -9 2 2 0 2
4566 4566 0 0 2 -9 2 2 0.33 0
4567 4567 0 0 1 -9 1.25 2 1 1
I tried this:
file="path/to/data.txt"
#get total number of columns
number_cols=$(awk -F' ' '{print NF; exit}' $file)
for ((i=7; i<=$number_cols; i=i+1))
do
echo $i
# getting the mean of each column
mean+=$(awk '{ total += $i } END { print total/NR }' $file)
done
# array of column means
echo ${mean[#]}
# find and replace (newstr must be replaced by respective column means)
find $file -type f -exec sed -i 's/NA/newstr/g' {} \;
However, this code is incomplete. The for loop is very slow since my data is huge. Is there another way to do this faster? I did this in Python and R, but it is too slow. I am open to get this done in any programming language as long as it is fast. Can someone please help me write the script?
Thanks

Related

Counting multiples columns and list the counts in separate columns and retain a column

I have the following Dataframe:
id coord_id val1 val2 record val3
0 snp chr15_1-1000 1.0 0.9 xx12 2
1 snv chr15_1-1000 1.0 0.7 yy12 -4
2 ins chr15_1-1000 0.01 0.7 jj12 -4
3 ins chr15_1-1000 1.0 1.5 zzy1 -5
4 ins chr15_1-1000 1.0 1.5 zzy1 -5
5 del chr10_2000-4000 0.1 1.2 j112 12
6 del chr10_2000-4000 0.4 1.1 jh12 15
I am trying to count the number of times each coord_id appears by each id but keeping the val1 column in the resulting table but only to include a range of the value in that column so for instance, I am trying accomplish the following result:
id snp snv ins del total val1
chr15_1-1000 1 1 3 0 5 0.01-1.0
chr10_2000-4000 0 0 0 2 2 0.1-0.4
I want to sort it in ascending order by the column total.
So much appreciate it in advance.
First pivot into id columns with count aggregation and margin sums. Then join() with the val1 min-max strings:
(df.pivot_table(index='coord_id', columns='id', values='val1',
aggfunc='count', fill_value=0,
margins=True, margins_name='total')
.join(df.groupby('coord_id').val1.agg(lambda x: f'{x.min()}-{x.max()}'))
.sort_values('total', ascending=False)
.drop('total'))
# del ins snp snv total val1
# coord_id
# chr15_1-1000 0 3 1 1 5 0.01-1.0
# chr10_2000-4000 2 0 0 0 2 0.1-0.4
I suggest making two computations separately -- get the range and count the frequency.
temp = test_df.groupby(['coord_id']).agg({'val1': ['min', 'max']})
temp.columns = temp.columns.get_level_values(1)
temp['val1'] = temp['min'].astype(str) + '-' + temp['max'].astype(str)
Then,
temp2 = test_df.groupby(['coord_id', 'id']).count().unstack('id').fillna(0)
temp2.columns = temp2.columns.get_level_values(1)
And, finally, merging
answer = pd.concat([temp, temp2], axis=1)

Summing over y values for same x value

I want gnuplot to plot the sum of all z values in all cases where the x and y values are equal.
A dummy data file looks like this:
#testfile
0 0 1
0 1 1
0 1 1
0 1 1
1 0 1
1 1 1
1 1 2
1 1 2
I am using plot "testfile" u 1:2:3 w p ps variable to scale the points according to the value in the third column, and I would like to find a command that gives the same plot for the above data file as if I were to plot this data file:
#testfile2
0 0 1
0 1 3
1 0 1
1 1 5
If that makes it easier, in my real data file, I always have to sum over two lines.
I don't know if you're looking for a gnuplot-only solution, but what you want could be accomplished with a simple awk one-liner, either ran separate or embedded on gnuplot. By the way, this assumes that you you always have to sum over two lines:
Input file:
0 1 1
0 1 1
1 0 1
1 0 2
1 1 2
1 1 2
By running:
awk '{sum+=$3} (NR%2)==0{print $1,$2,sum; sum=0;}' testfile
You would get:
0 1 2
1 0 3
1 1 4
Then you could save in a separate file and plot with the line you mentioned above. Alternatively, you can embed the awk line within gnuplot using:
plot "<awk '{sum+=$3} (NR%2)==0{print $1,$2,sum; sum=0;}' testfile" u 1:2:3 not w p ps variable pt 7
Hope it helps!

Loop every three consecutive rows in linux

I have a file hundred.txt containing 100 rows.
For example:
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
....
1 0 0 1
I need to manipulate some calculations within every 3 consecutive rows, for instance, I need to use the Row1-Row3 first to do my calculation:
1 0 0 1
1 1 0 1
1 0 1 0
then the Row2-Row4:
1 1 0 1
1 0 1 0
1 0 1 0
...... the Row98-Row100.
Each output will generate a file (e.g. Row1.txt, Row2.txt,... Row98.txt), How can I solve this problem? Thank you.
bash isn't a great choice for data processing tasks, but it is possible (albeit slow):
{ read row1
read row2
count=0
while read row3; do
# Do something with rows 1-3
{ echo $row1 $row2 $row3; } > Row$((count+=1)).txt
# Slide the window
row1=$row2
row2=$row3
done
} < hundred.txt
awk to the rescue!
$ awk 'NR>2{printf "%s", a2 ORS a1 ORS $0 ORS > FILENAME"."(++c)}
{a2=a1;a1=$0}' file
for the input file
$ cat file
1 0 0 1
1 1 0 1
1 0 1 0
1 0 1 0
0 1 1 0
generates these 3
$ head file.{1..3}
==> file.1 <==
1 0 0 1
1 1 0 1
1 0 1 0
==> file.2 <==
1 1 0 1
1 0 1 0
1 0 1 0
==> file.3 <==
1 0 1 0
1 0 1 0
0 1 1 0
you can embed your computation is the script and output only the results but you didn't provide any details on that.
Explanation
NR>2 starting third row
printf ... start printing last 3 rows
> FILENAME"."(++c) to a file derived from input filename with counter suffix
a2=a1;a1=$0 update last two rows
if your rolling window is small n you can scale this script by changing NR>(n-1) and keeping track of last rows in a(n-1)...a1 and printing accordingly. If n is large, better to use a array (or better a circular array).
This is perhaps the most generic version...
$ awk -v n=3 'NR>n-1{fn=FILENAME"."c;
for(i=c+1;i<c+n;i++) printf "%s\n", a[(i-n)%n] > fn;
print > fn}
{a[(c++)%n]=$0}' file
One hundred rows of four binary-valued columns is not too much; just read it all in at once.
mapfile -t rows < inputfile
for r in "${!rows[#]}"; do # loop by row index
(( r >= 2 )) || continue
# process "${rows[r-2]}" "${rows[r-1]}" and "${rows[r]}"
# into file Row$((r-1))
done
If the quantity of data grows significantly, you really want to use a better tool, such as Python+numpy (because your data looks like binary matrices).

Gnuplot draw logical gate output in time

I am working on a school project, which is a simulation of logical gates. I can implement and run the simulation with ease, but i need help with showing the output.
Right now, i print everything to the console, like this:
sample frequency: 50
###############################################
IN NOT(1) OUT
IN1:0 IN1:3 IN1:5
IN2:0 IN2:0 IN2:0
OUT:3 OUT:5 OUT:0
0 1 -1 -1
50 1 -1 -1
100 1 0 0
150 0 0 0
200 1 1 1
250 1 0 0
300 1 0 0
350 1 0 0 (IN = 1, delay is 1 so we can see
400 0 0 0 the correct output of NOT element in line 400 <-> 350 + 1*50)
450 1 1 1
500 1 0 0
550 1 0 0
600 1 0 0
650 0 0 0
700 0 1 1
750 1 1 1
800 1 0 0
850 1 0 0
900 1 0 0
950 1 0 0
1000 1 0 0
on the left, there is the simulation time (step). In each step, the values are printed out and new set of inputs is generated.
where there is -1, this means undefined output.
The 3rd row ( IN NOT(1) OUT ) means that there are 3 elements, 1 input, 1 NOT gate and an output. The value in brackets means the delay of the element, so an element with delay value of X will show the correct output after X*sample_freq (excluding the 0 time).
The rows after mean:
IN1 - the index of the node that is read as input 1
IN2 - the index of the node that is read as input 2
OUT - the index of the output node
In this situation, the IN is giving its output to node #3. The NOT element reads its input from node #3 and gives some output to node #5. The overall output of this system is the OUT element, which reads from #5.
Here is the file that specifies the topology:
3 (number of elems)
IN 0 0 3 (no inputs for input element obviously)
NOT 3 0 5 (reads from #3 and outputs to #5)
OUT 5 0 0 (reads from #5 and this is the end point of the system)
There can obviously be more elements, IN's and OUT's, but lets stick to this for the sake of simplicity.
And what i want to see as the result is: X-axis tells the simulation time (0 - 1000, step is 50), y axis tells the output value of each element in the system and the elements write their output one above the other, see this picture as an example.
Can you tell me how to create this kind of gnuplot script, that transforms the output of my application into the desired plot?
Thank you!
ok, I have found a solultion myself, here it is:
first, I had to transform the output of the app a bit, so that it looks like this:
0 1 2 4
49 1 2 4
50 1 2 4
99 1 2 4
100 0 2 4
149 0 2 4
150 0 3 5
199 0 3 5
200 1 3 5
249 1 3 5
250 1 2 4
299 1 2 4
300 0 2 5
349 0 2 5
350 1 3 5
399 1 3 5
400 0 2 4
449 0 2 4
450 1 3 5
499 1 3 5
the extra sim time steps make the edges look almost square, I also separated each column by 2 (added 0 to column #2, added 2 to column #3, added 4 to column #4 and so on), so that it is drawn one above each other and the simple command to plot this is:
plot 'out.txt' using 1:2 with lines, 'out.txt' using 1:3 with lines, 'out.txt' using 1:4 with lines
plus some set xtics, set ytics and other cosmetic stuff
now I have to deal with naming the lines with the names of the elements and voila.

Create new data frame columns with binary (0/1) data based on text strings in existing column in R

I have an R data frame that looks like:
ID YR SC
ABX:22798 1976 A's Frnd; Cat; Cat & Mse
ABX:23798 1983 A's Frnd; Cat; Zebra Fish
ABX:22498 2010 Zebra Fish
ABX:22728 2010 Bear; Dog; Zebra Fish
ABX:22228 2011 Bear
example data:
df <- structure(list(ID = c("ABX:22798", "ABX:23798", "ABX:22498", "ABX:22728", "ABX:22228"), YR = c(1976, 1983, 2010, 2010, 2011), SC = c("A's Frnd; Cat; Cat & Mse", "A's Frnd; Cat; Zebra Fish", "Zebra Fish", "Bear; Dog; Zebra Fish", "Bear")), .Names = c("ID", "YR", "SC"), row.names = c(NA, 5L), class = "data.frame")
That I would like to transform by splitting the text string in the SC column by "; ". Then, I'd like to use the resulting lists of strings to populate new columns with binary data. The final data frame would look like this:
ID YR A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
ABX:22798 1976 1 0 1 1 0 0
ABX:23798 1983 1 0 1 0 0 1
ABX:22498 2010 0 0 0 0 0 1
ABX:22728 2010 0 1 0 0 1 1
ABX:22228 2011 0 1 0 0 0 0
I'll be analyzing a number of different datasets individually. In any given data set, there are between about 100 and 230 unique SCs entries, and the number of rows per set ranges from about 500 to several thousand. The number of SCs per row ranges from 1 to about 6 or so.
I have had a couple of starts with this, most are quite ugly. I thought the approach below looked promising (it's similar to a python pandas implementation that works well). It would be great to learn a good way to do this in R!
My starter code:
# Get list of unique SCs
SCs <- df[,2]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically,
# note that apostrophes could be a problem
SCunique <- sort(SCunique)
# create a dataframe of 0s to add to the original df
df0 <- as.data.frame(matrix(0, ncol=length(SCunique), nrow=nrow(df)))
colnames(df0) <- SCunique
...(and then...?)
I've found similar questions/answers, including:
Dummy variables from a string variable Split strings into columns in R where each string has a potentially different number of column entries
Edit: Found one more answer set of interest:
Improve text processing speed using R and data.table
Thanks in advance for your answers.
I think this should do what you're looking for:
library(reshape2)
a <- strsplit(as.character(df$SC), "; ", fixed = TRUE)
dfL <- merge(df, melt(a), by.x = "row.names", by.y = "L1", all = TRUE)
dcast(dfL, ID + YR ~ value, value.var="value", fun.aggregate=length, fill = 0)
# ID YR A's Frnd Cat Cat & Mse Zebra Fish Bear Dog
# 1 ABX:22228 2011 0 0 0 0 1 0
# 2 ABX:22498 2010 0 0 0 1 0 0
# 3 ABX:22728 2010 0 0 0 1 1 1
# 4 ABX:22798 1976 1 1 1 0 0 0
# 5 ABX:23798 1983 1 1 0 1 0 0
Factor the "value" column first if the column order is important to you.
I also have a package called "splitstackshape" that would work nicely with this if your data didn't have quotes in it. That's a bug that I'm looking into. Once it is resolved, it should be possible to do a command like:
library(reshape2)
library(splitstackshape)
dcast(concat.split.multiple(df, "SC", ";", "long"),
ID + YR ~ SC, value.var="SC", fun.aggregate=length, fill=0)
to get what you need.
Continuing on your logic, you can do something like:
##your code
# Get list of unique SCs
SCs <- df[,3] #NOTE: here, you have df[,2], but I guess should have been df[,3]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically,
# note that apostrophes could be a problem
SCunique <- sort(SCunique)
##
df0 <- sapply(SCunique, function(x) as.integer(grepl(x, SCs)))
df0
# A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
#[1,] 1 0 1 1 0 0
#[2,] 1 0 1 0 0 1
#[3,] 0 0 0 0 0 1
#[4,] 0 1 0 0 1 1
#[5,] 0 1 0 0 0 0

Resources