awk split adds whole string to array position 1 (reason unknown) - linux

So I have a .txt file that looks like this:
mona 70 77 85 77
john 85 92 78 80
andreja 89 90 85 94
jasper 84 64 81 66
george 54 77 82 73
ellis 90 93 89 88
I have created a grades.awk script that contains the following code:
{
FS=" "
names=$1
vi1=$2
vi2=$3
vi3=$4
rv=$5
#printf("%s ",names);
split(names,nameArray," ");
printf("%s\t",nameArray[1]); //prints the whole array of names for some reason, instead of just the name at position 1 in array ("john")
}
So my question is, how do I split this correctly? Am I doing something wrong?
How do you read line by line, word by word correctly. I need to add each column into its own array. I've been searching for the answer for quite some time now and can't fix my problem.

here is a template to calculate average grades per student
$ awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i;
printf "%s\t%5.2f\n", $1, sum/(NF-1)}' file
mona 77.25
john 83.75
andreja 89.50
jasper 73.75
george 71.50
ellis 90.00

printf("%s\t",nameArray[1])
is doing exactly what you want it to do but you aren't printing any newline between invocations so it's getting called once per input line and outputting one word at a time but since you aren't outputting any newlines between words you just get 1 line of output. Change it to:
printf("%s\n",nameArray[1])
There are a few other issues with your code of course (e.g. you're setting FS in the wrong place and unnecessarily, names only every contains 1 word so splitting it into an array doesn't make sense, etc.) but I think that's what you were asking about specifically.
If that's not all you want then edit your question to clarify what you're trying to do and add concise, testable sample input and expected output.

Related

replacing column value based in a line containing a specific string

I have text file :
file_11199_name 45 69
file_11176_name 45 69
file_11156_name 45 69
where i want change the value of column three to 1 when the first column has "11199" in the string.
the next, 3-lines, of AWK code seems to do what you need:
{c=$3}
$1~11199{c=1}
{ print $1,$2,c }
line 1 assigns a variable (c) with the value of the third column.
line 2 assigns the value 1 to c if the first variable contains 11199 ($1~11199)
line 3 prints the output
$ awk '{c=$3}$1~11199{c=1}{ print $1,$2,c }' file
file_11199_name 45 1
file_11176_name 45 69
file_11156_name 45 69

Datamash: Transposing the column into rows based on group in bash

I have a tab delim file with a 2 columns like following
A 123
A 23
A 45
A 67
B 88
B 72
B 50
B 23
C 12
C 14
I want to transpose with the above data based on the first column like following
A 123 23 45 67
B 88 72 50 23
C 12 14
I tried the datamash transpose < input-file.txt but it didnt yield the output as expected.
One awk version:
awk '{printf ($1!=f?"\n%s":" "$2),$0;f=$1}' file
A 123 23 45 67
B 88 72 50 23
C 12 14
With this version, you get on blank line, but should be fast and handle large data since no loop or array variable are used.
$1!=f?"\n%s":" "$2),$0 If first field is not equal f, print new line and all fields
if $1 = f, only print field 2.
f=$1 set f to first field
datamash --group=1 --field-separator=' ' collapse 2 <file | tr ',' ' '
Output:
A 123 23 45 67
B 88 72 50 23
C 12 14
Input must be sorted, as in the question.
This might work for you (GNU sed):
sed -E ':a;N;s/^((\S+)\s+.*)\n\2/\1/;ta;P;D' file
Append the next line and if the first field of the first line is the same as the first field of the second line, remove the newline and the first field of the second line. Print the first line in the pattern space and then delete it and the following newline and repeat.

Where to place the return statement when defining a function to read in a file using with open(...) as ...?

I have a text file consisting of data that is separated by tab-delimited columns. There are many ways to read data in from the file into python, but I am specifically trying to use a method similar to one outlined below. When using a context manager like with open(...) as ..., I've seen that the general concept is to have all of the subsequent code indented within the with statement. Yet when defining a function, the return statement is usually placed at the same indentation as the first line of code within the function (excluding cases with awkward if-else loops). In this case, both approaches work. Is one method considered correct or generally preferred over the other?
def read_in(fpath, contents=[], row_limit=np.inf):
"""
fpath is filelocation + filename + '.txt'
contents is the initial data that the file data will be appeneded to
row_limit is the maximum number of rows to be read (in case one would like to not read in every row).
"""
nrows = 0
with open(fpath, 'r') as f:
for row in f:
if nrows < row_limit:
contents.append(row.split())
nrows += 1
else:
break
# return contents
return contents
Below is a snippet of the text-file I am using for this example.
1996 02 08 05 17 49 263 70 184 247 126 0 -6.0 1.6e+14 2.7e+28 249
1996 02 12 05 47 26 91 53 160 100 211 236 2.0 1.3e+15 1.6e+29 92
1996 02 17 02 06 31 279 73 317 257 378 532 9.9 3.3e+14 1.6e+29 274
1996 02 17 05 18 59 86 36 171 64 279 819 27.9 NaN NaN 88
1996 02 19 05 15 48 98 30 266 129 403 946 36.7 NaN NaN 94
1996 03 02 04 11 53 88 36 108 95 120 177 1.0 1.5e+14 8.7e+27 86
1996 03 03 04 12 30 99 26 186 141 232 215 2.3 1.6e+14 2.8e+28 99
And below is a sample call.
fpath = "/Users/.../sample_data.txt"
data_in = read_in(fpath)
for i in range(len(data_in)):
print(data_in[i])
(I realize that it's better to use chunks of pre-defined sizes to read in data, but the number of characters per row of data varies. So I'm instead trying to give user control over the number of rows read in; one could read in a subset of the rows at a time and append them into contents, continually passing them into read_in - possibly in a loop - if the file size is large enough. That said, I'd love to know if I'm wrong about this approach as well, though this isn't my main question.)
If your function needs to do some other things after writing to the file, you usually do it outside the with block. So essentially you need to return outside the with block too.
However if the purpose of your function is just to read in a file, you can return within the with block, or outside it. I believe none of the methods are preferred in this case.
I don't really understand your second question.
You can put return also withing with context.
By exiting context, the cleanup are done. This is the power of with, not to need to check all possible exit paths. Note: also with exception inside with the exit context is called.
But if file is empty (as an example), you should still return something. So in such case your code is clear, and follow the principle: one exit path. But if you should handle end of file without finding something important, I would putting normal return within with context, and handle the special case after it.

Variable Delimited Text in Excel

I have a string of text that I need delimited:
New Utilizers 75 28 9 66 66 79 74 69 29 21 84 75 675 20,511 45,925
Ordinarily I would just use a space delimiter and I'd be all set, but this splits the "New Utilizers" string into two columns instead of one. Is there a way to start delimiting after a certain point, in this case start after new utilizers
Can you choise another delimiter? say $ or ;
if $ for example
New Utilizers$75$28$9$66$66$79$74$69$29$21$84$75$675$20,511$45,925
then split by $

Ignore first few lines and last few lines in a file Linux

I have a file like this and would like to print $0 except the first two and last three lines in linux. Tried awk command but no luck, is there any options I am using the following command - I suppose I am doing something wrong, but not able to figure out what it is with my minimal experience in computer science.
awk '{if(NR>2){c++}else if(FNR<=c-3){print $0}}' samplefile.out > sampleout.txt
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
Thanks
awk cannot look ahead so you'll have to save the lines.
awk 'NR>2{if(z!="")print z;z=y;y=x;x=$0}' file
Practically zero memory overhead
You can do it with a combination of head and tail:
tail -n +2 samplefile.out | head -n -3 > sampleout.txt
Try this:
awk 'NR>2{a[++j]=$0}END{for (i=1;i<=j-3;i++){print a[i]}}' samplefile.out
There's no way to calculate the lines of the file if you don't read or save previous line first.
If the archive is too big , head + tail mix could be better to avoid a memory overhead.
You may also try this, but it uses array
$ cat file
entry0 45
entry0 42
entry1 41
entry2 78
entry3 89
entry4 68
entryn 58
entryn 33
etnryn 52
$ awk 'NR>first+last{print A[NR%last]}{A[NR%last]=$0}' first=2 last=3 file
entry1 41
entry2 78
entry3 89
entry4 68

Resources