I have been provided a data file in a format I have never seen. The data do not appear to be in columns, but rather in one long row. I can open the file in Notepad and see the data. So, the data do not appear to be encrypted.
When I open the data file in Notepad the row of data wraps back to the to left side of the Notepad window when I guess the data reach the maximum number of characters that Notepad allowed in a single row, and then the data continue in a new row.
There might be 10,000 rows of data when I open the file in Notepad. The data in one of these rows are not aligned with the data in the row above it or below it.
Here are some example data:
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1304 3 0 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 0205 0 3 0
40001 1 5 GGGG 2998 HURG SU111111 95 1.0 F1 4 0805 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1205 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1505 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2003 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2303 2 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2703 3 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999
Notice that when I paste the example data here, representing one row in Notepad, the columns are 'magically' aligned.
I have found that I can open the data file in Excel and the data are also aligned. I do need to manually assign column boundaries in Excel however. And Excel does not allow me to assign a column boundary beyond more-or-less Character Space 123.
Below is SAS code to read the data file, although this SAS code does not work correctly. Rather I guess this SAS code skips some of the data rows. Notice that the variable TT covers character spaces 125-207, but that there are only 120 characters in most rows. There are more than 120 characters in some rows. This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly.
option linesize = 210 ;
option pagesize = 30 ;
FILENAME myinput 'C:/Users/markm/simple SAS programs/mydata.new' ;
DATA mydata ;
INFILE myinput ;
INPUT
AA 2-9
BB 12-17
CC 18-22
DD $ 24-27
EE 30-33
FF $ 35-38
GG $ 40-47
HH 53-56
II 59-64
JJ $ 66-68
KK $ 70-71
LL 72-78
MM 79-85
NN $ 87-90
OO 91-95
PP 97-104
QQ 105-110
RR 112-120
SS $ 122-123
TT $ 125-207 ;
If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad.
All of this is telling me there are hidden characters in the data file used to identify the end of a line of data.
I opened the data file in Vim hoping to see these hidden characters, but did not see anything. Vim did align the columns correctly when I opened the file. So, Vim must be seeing these hidden end-of-line characters.
How can I see these end-of-line characters myself? I suspect there is an option in Vim to reveal the hidden characters.
How can I determine the application that created this data file?
How can I modify the above SAS code to read this data file correctly?
First off, double check your LRECL. You're missing basically half of your data, which makes me think you're reading in two lines for each line. You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there.
Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. If the latter, you have a DOS EOF character (1A) in the data, and you need to set the IGNOREDOSEOF option. If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767 and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one).
Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile.
Read in a dummy character, and then put the _infile_ line with hex. format. For example:
data test;
infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
input #1 x $1. #;
r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
put r;
put _infile_;
put _infile_ hex512.;
stop; *we want to see just one line here;
run;
In that case i'm reading in 20 long lines, and using hex40., as it needs to be exactly double the line length. You can leave the length off (hex.) but you'll get some really long lines with tons of blanks if you do that. In your case, lrecl=207, you should use hex414. in theory (But might want to make your lrecl 256 and hex512. just in case). Since we're using RECFM=F, the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. (If one line doesn't tell you enough about this, use firstobs= to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks).
That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). See this page for a list of ASCII codes.
Hex codes to look for:
1A = DOS EOF character.
0A = LF
0D = CR
If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A in a row, somewhere around 207. If this is a Unix document, you will see just 0A there. If this is a Mac OS document, you may see LFCR, or 0A0D. Why would anyone want to be consistent.
You probably will see something, since you're getting some number of lines. (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems:
This is a DBCS file, so all characters really take up more than one byte. If you see a lot of 00 or 40 or 20 between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. They use two bytes for every character in order to represent the full set of characters in their languages; but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case).
This is a UTF-8 file, where characters may take multiple bytes (but may not). In this case you probably see some 'junk' in the data when viewing it this way, and every so often you get a character that takes up two or three spaces - often entirely full of 'junk' characters. UTF-8 can take between 1 and 4 bytes per character, usually powers of 2 (so 1,2,4) but will look 'normal' for ASCII characters (ie, it takes ASCII and adds a lot, making relatively few changes in the 00-7F range).
My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me).
Here is how to see the hidden end-of-line characters in gVim 7.4:
Open gVim 7.4
Open the data file in gVim 7.4
Press the escape key a few times to access the line editor. Note pressing the escape key
will result in no visible result on the gVim 7.4 window.
Type :set list at the bottom of the gVim 7.4 window
Press the enter key
Once I did the above I saw a blue $ at the end of every line, which I assume is an end-of-line hidden character.
Maybe if I am able to remove these blue $ symbols and save the result under a new name SAS might be able to read that new data file. If I figure this out I will post an update.
EDIT
I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M
I typed :%s/$//g which replaced the blue $ with yellow $. Then I saved the file under a new name and opened the new file with gVim. But when I typed :set list the blue $ were still present in the new file.
Related
How can I put a selection of yanked lines at the end of other lines?
BEFORE:
11
1
2
3
10
0.0
0.045
0.09
0
AFTER:
11
1 0.0
2 0.045
3 0.09
10 0
To make a block selection until the line-end, even for an unmatched number of columns, first select vertically the first column in all lines and then press $.
So in your case, move the cursor to the first column of 0.0, press ctrl-v, move the cursor down to the last line, press $ and you have the block selection. Now you can cut it with d and paste it in the second line.
Edit: Since you put this 10 there, the pasting part gets a bit more complicated. I would first past the block in the right-most column that doesn't interfere with the other columns. So go to the second line, append two blanks after the 1 and then paste the block. The result will look like this:
11
1 0.0
2 0.045
3 0.09
10 0
Now you need to remove the duplicated blanks in all the lines (if they disturb you). In a simple example like this, you can do it just with another block selection. In more complicated examples you might do it with a search/replace pattern.
For this specific case you could also use :g and :m to reorder the lines and join them. With the cursor at the first line, do a }dd to delete the empty line. Cursor should be at the line with 0.0. Now do a simple:
:,$g/./2m-1|j
This works with :g acting as a sort of iterator over the lines. The :g command selects lines within a range matching a certain pattern and runs a command on such matched lines in the form :[range]g{pattern}{command}. Breaking it into smaller parts we have:
,$ this range is a short form of .,$, meaning from current line to the end of the file. That's why the cursor position is important.
/./ match lines containing any character, effectively a placeholder to say match every line.
2m-1|j the "tricky" part, composed of two commands:
2m-1 the move command, in the form [range]m{address}. Moves lines from range to below address. As utilized it means move the second line to address -1 which itself is another piece of "trickery". -1 is a short form of .-1 which means current line position (.) minus one. At this point, the cursor is moved upwards to sit on the line just moved.
j join command, joins the current line to the next one with a space in between.
You can simulate what happens step by step by running :2m-1 and then :j yourself on each line starting from the line with 0.0. Again, the :g here works just as an iterator for these commands over those last 4 lines.
Starting with (* represents cursor position):
11
1
2
3
10
* 0.0
0.045
0.09
0
Running :2 move .-1 transforms it into:
11
2
3
10
* 1
0.0
0.045
0.09
0
Then :join
11
2
3
10
* 1 0.0
0.045
0.09
0
Move to the next line and repeat.
#GeorgP.'s answer already outlines the necessity for doing a blockwise delete, so that the paste does not create new lines, but appends to the existing ones. It does not mention that the blockwise delete leaves behind empty lines, so there's an additional step to get rid of them.
I needed to "cast" register contents into a certain (characterwise / linewise / blockwise) mode so often, I wrote the UnconditionalPaste plugin for it. With it, you can delete the second block completely (i.e. linewise) with 4dd, and then use the plugin's gbp (go blockwise paste) mapping at the end of the second line. Like #GeorgP.'s answer, this will make the last line contain 100 instead of 10 0, and the same workaround of inserting an additional padding space is necessary.
However, my plugin offers many other variants of pastes, and the gBp pastes as a minimal fitting (not rectangular) block with a jagged right edge. When pasting with gBp at the end of the line, appends at the jagged end of following lines. This is exactly what is needed here. So, if you're willing to install a plugin, the whole operation can be done via 4dd2G$gBp.
I’d like to merge two blocks of lines in Vim, i.e., take lines k through l and append them to lines m through n. If you prefer a pseudocode explanation: [line[k+i] + line[m+i] for i in range(min(l-k, n-m)+1)].
For example,
abc
def
...
123
45
...
should become
abc123
def45
Is there a nice way to do this without copying and pasting manually line by line?
You can certainly do all this with a single copy/paste (using block-mode selection), but I'm guessing that's not what you want.
If you want to do this with just Ex commands
:5,8del | let l=split(#") | 1,4s/$/\=remove(l,0)/
will transform
work it
make it
do it
makes us
harder
better
faster
stronger
~
into
work it harder
make it better
do it faster
makes us stronger
~
UPDATE: An answer with this many upvotes deserves a more thorough explanation.
In Vim, you can use the pipe character (|) to chain multiple Ex commands, so the above is equivalent to
:5,8del
:let l=split(#")
:1,4s/$/\=remove(l,0)/
Many Ex commands accept a range of lines as a prefix argument - in the above case the 5,8 before the del and the 1,4 before the s/// specify which lines the commands operate on.
del deletes the given lines. It can take a register argument, but when one is not given, it dumps the lines to the unnamed register, #", just like deleting in normal mode does. let l=split(#") then splits the deleted lines into a list, using the default delimiter: whitespace. To work properly on input that had whitespace in the deleted lines, like:
more than
hour
our
never
ever
after
work is
over
~
we'd need to specify a different delimiter, to prevent "work is" from being split into two list elements: let l=split(#","\n").
Finally, in the substitution s/$/\=remove(l,0)/, we replace the end of each line ($) with the value of the expression remove(l,0). remove(l,0) alters the list l, deleting and returning its first element. This lets us replace the deleted lines in the order in which we read them. We could instead replace the deleted lines in reverse order by using remove(l,-1).
An elegant and concise Ex command solving the issue can be obtained by
combining the :global, :move, and :join commands. Assuming that
the first block of lines starts on the first line of the buffer, and
that the cursor is located on the line immediately preceding the first
line of the second block, the command is as follows.
:1,g/^/''+m.|-j!
For detailed explanation of this technique, see my answer to
an essentially the same question “How to achieve the “paste -d '␣'”
behavior out of the box in Vim?”.
To join blocks of line, you have to do the following steps:
Go to the third line: jj
Enter visual block mode: CTRL-v
Anchor the cursor to the end of the line (important for lines of differing length): $
Go to the end: CTRL-END
Cut the block: x
Go to the end of the first line: kk$
Paste the block here: p
The movement is not the best one (I'm not an expert), but it works like you wanted. Hope there will be a shorter version of it.
Here are the prerequisits so this technique works well:
All lines of the starting block (in the example in the question abc and def) have the same length XOR
the first line of the starting block is the longest, and you don't care about the additional spaces in between) XOR
The first line of the starting block is not the longest, and you additional spaces to the end.
Here's how I'd do it (with the cursor on the first line):
qama:5<CR>y$'a$p:5<CR>dd'ajq3#a
You need to know two things:
The line number on which the first line of the second group starts (5 in my case), and
the number of lines in each group (3 in my example).
Here's what's going on:
qa records everything up to the next q into a "buffer" in a.
ma creates a mark on the current line.
:5<CR> goes to the next group.
y$ yanks the rest of the line.
'a returns to the mark, set earlier.
$p pastes at the end of the line.
:5<CR> returns to the second group's first line.
dd deletes it.
'a returns to the mark.
jq goes down one line, and stops recording.
3#a repeats the action for each line (3 in my case)
As mentioned elsewhere, block selection is the way to go. But you can also use any variant of:
:!tail -n -6 % | paste -d '\0' % - | head -n 5
This method relies on the UNIX command line. The paste utility was created to handle this sort of line merging.
PASTE(1) BSD General Commands Manual PASTE(1)
NAME
paste -- merge corresponding or subsequent lines of files
SYNOPSIS
paste [-s] [-d list] file ...
DESCRIPTION
The paste utility concatenates the corresponding lines of the given input files, replacing all but the last file's newline characters with a single tab character,
and writes the resulting lines to standard output. If end-of-file is reached on an input file while other input files still contain data, the file is treated as if
it were an endless source of empty lines.
Sample data is the same as rampion's.
:1,4s/$/\=getline(line('.')+4)/ | 5,8d
I wouldn't think make it too complicated.
I would just set virtualedit on
(:set virtualedit=all)
Select block 123 and all below.
Put it after the first column:
abc 123
def 45
... ...
and remove the multiple space between to 1 space:
:%s/\s\{2,}/ /g
I would use complex repeats :)
Given this:
aaa
bbb
ccc
AAA
BBB
CCC
With the cursor on the first line, press the following:
qa}jdd''pkJxjq
and then press #a (and you may subsequently use ##) as many times as needed.
You should end up with:
aaaAAA
bbbBBB
cccCCC
(Plus a newline.)
Explaination:
qa starts recording a complex repeat in a
} jumps to the next empty line
jdd deletes the next line
'' goes back to the position before the last jump
p paste the deleted line under the current one
kJ append the current line to the end of the previous one
x delete the space that J adds between the combined lines; you can omit this if you want the space
j go to the next line
q end the complex repeat recording
After that you'd use #a to run the complex repeat stored in a, and then you can use ## to rerun the last ran complex repeat.
There can be many number of ways to accomplish this. I will merge two blocks of text using any of the following two methods.
suppose first block is at line 1 and 2nd block starts from line 10 with the cursor's initial position at line number 1.
(\n means pressing the enter key.)
1. abc
def
ghi
10. 123
456
789
with a macro using the commands: copy,paste and join.
qaqqa:+9y\npkJjq2#a10G3dd
with a macro using the commands move a line at nth line number and join.
qcqqc:10m .\nkJjq2#c
How I can move or shift the words in the entire file to the specified column?
For example like below:
Before :
123 ABC
112 XYZS
15925 asdf
1111 25asd
1 qwer
After :
123 ABC
112 XYZS
15925 asdf
1111 25asd
1 qwer
How it can be done using command mode?
Here the thing is we need to shift the 2nd word to the specified column
Here the specified column is 8
except for vim-plugins mentioned by others, if you were working on a linux box with column command available, you could just :
%!column -t
% could be vim ranges, e.g. visual selections etc..
Approach with built-in commands
First :substitute the whitespace with a Tab character, and then :retab to a tab stop to column 8, expanding to spaces (for your given example):
:.,.+4substitute/\s\+/\t/ | set tabstop=7 expandtab | '[,']retab
(I'm omitting the resetting of the modified options, should that matter to you.)
Approach with plugin
My AlignFromCursor plugin has commands that align text to the right of the cursor to a certain column. Combine that with a :global command that invokes this for all lines in the range, and a W motion to go to the second word in each, and you'll get:
.,.+4global/^/exe 'normal! W' | LeftAlignFromCursor 8
I use the Tabular plugin. After installing it, you run the following command:
:%Tab/\s
where \s means whitespace character
I have made two functions for this problem.
I have posted it here : https://github.com/imbichie/vim-vimrc-/blob/master/MCCB_MCCE.vim
We need to call this function in vim editor and give the Number of Occurrence of the Character or Space that you wants to move and the character inside the '' and the column number.
The number of occurrence can be from the starting of each line (MCCB function) or can be at the end of each line (MCCE function).
for the above example mentioned in the question we can use the MCCB function and the character we can use space, so the usage will be like this in the vim editor.
:1,5call MCCB(1,' ',8)
So this will move the first space (' ') to the 8th column from line number 1 to 5.
9
37
92
93
96
98
118
128
135
136
139
I have about 13K plus records like the list above. And I want to append a ',' after every number?
What would be the best/easiest way to do this?
This sounds like a job for Notepad++!
Hit Ctrl+F and choose the Replace tab, then fill in these details:
Find what: \r\n
Replace with: ,\r\n
Search mode: Extended
Click Replace All, and Bob's your uncle!
Depending on your os and language:
Read a linw
Write all but the carriage return
Write ,
Write the carriage return
carry on at 1. unless you have finished.
On Unix/Linux boxes you could use sed or awk, just about any programming language, I wouldn't recommend asm or fortran but not impossible with them.
On any machine any language available that supports text processing, any utility that supports regular expressions. In excel you can create another column that concatenated the contents of the cell to the left and ',' then hide the original column
for data in column A1 you can use below
CONCAT(A1,CHAR(39), char(44),CHAR(39))
FYI,
CHAR(39) - '
CHAR(44) - ,
I’d like to merge two blocks of lines in Vim, i.e., take lines k through l and append them to lines m through n. If you prefer a pseudocode explanation: [line[k+i] + line[m+i] for i in range(min(l-k, n-m)+1)].
For example,
abc
def
...
123
45
...
should become
abc123
def45
Is there a nice way to do this without copying and pasting manually line by line?
You can certainly do all this with a single copy/paste (using block-mode selection), but I'm guessing that's not what you want.
If you want to do this with just Ex commands
:5,8del | let l=split(#") | 1,4s/$/\=remove(l,0)/
will transform
work it
make it
do it
makes us
harder
better
faster
stronger
~
into
work it harder
make it better
do it faster
makes us stronger
~
UPDATE: An answer with this many upvotes deserves a more thorough explanation.
In Vim, you can use the pipe character (|) to chain multiple Ex commands, so the above is equivalent to
:5,8del
:let l=split(#")
:1,4s/$/\=remove(l,0)/
Many Ex commands accept a range of lines as a prefix argument - in the above case the 5,8 before the del and the 1,4 before the s/// specify which lines the commands operate on.
del deletes the given lines. It can take a register argument, but when one is not given, it dumps the lines to the unnamed register, #", just like deleting in normal mode does. let l=split(#") then splits the deleted lines into a list, using the default delimiter: whitespace. To work properly on input that had whitespace in the deleted lines, like:
more than
hour
our
never
ever
after
work is
over
~
we'd need to specify a different delimiter, to prevent "work is" from being split into two list elements: let l=split(#","\n").
Finally, in the substitution s/$/\=remove(l,0)/, we replace the end of each line ($) with the value of the expression remove(l,0). remove(l,0) alters the list l, deleting and returning its first element. This lets us replace the deleted lines in the order in which we read them. We could instead replace the deleted lines in reverse order by using remove(l,-1).
An elegant and concise Ex command solving the issue can be obtained by
combining the :global, :move, and :join commands. Assuming that
the first block of lines starts on the first line of the buffer, and
that the cursor is located on the line immediately preceding the first
line of the second block, the command is as follows.
:1,g/^/''+m.|-j!
For detailed explanation of this technique, see my answer to
an essentially the same question “How to achieve the “paste -d '␣'”
behavior out of the box in Vim?”.
To join blocks of line, you have to do the following steps:
Go to the third line: jj
Enter visual block mode: CTRL-v
Anchor the cursor to the end of the line (important for lines of differing length): $
Go to the end: CTRL-END
Cut the block: x
Go to the end of the first line: kk$
Paste the block here: p
The movement is not the best one (I'm not an expert), but it works like you wanted. Hope there will be a shorter version of it.
Here are the prerequisits so this technique works well:
All lines of the starting block (in the example in the question abc and def) have the same length XOR
the first line of the starting block is the longest, and you don't care about the additional spaces in between) XOR
The first line of the starting block is not the longest, and you additional spaces to the end.
Here's how I'd do it (with the cursor on the first line):
qama:5<CR>y$'a$p:5<CR>dd'ajq3#a
You need to know two things:
The line number on which the first line of the second group starts (5 in my case), and
the number of lines in each group (3 in my example).
Here's what's going on:
qa records everything up to the next q into a "buffer" in a.
ma creates a mark on the current line.
:5<CR> goes to the next group.
y$ yanks the rest of the line.
'a returns to the mark, set earlier.
$p pastes at the end of the line.
:5<CR> returns to the second group's first line.
dd deletes it.
'a returns to the mark.
jq goes down one line, and stops recording.
3#a repeats the action for each line (3 in my case)
As mentioned elsewhere, block selection is the way to go. But you can also use any variant of:
:!tail -n -6 % | paste -d '\0' % - | head -n 5
This method relies on the UNIX command line. The paste utility was created to handle this sort of line merging.
PASTE(1) BSD General Commands Manual PASTE(1)
NAME
paste -- merge corresponding or subsequent lines of files
SYNOPSIS
paste [-s] [-d list] file ...
DESCRIPTION
The paste utility concatenates the corresponding lines of the given input files, replacing all but the last file's newline characters with a single tab character,
and writes the resulting lines to standard output. If end-of-file is reached on an input file while other input files still contain data, the file is treated as if
it were an endless source of empty lines.
Sample data is the same as rampion's.
:1,4s/$/\=getline(line('.')+4)/ | 5,8d
I wouldn't think make it too complicated.
I would just set virtualedit on
(:set virtualedit=all)
Select block 123 and all below.
Put it after the first column:
abc 123
def 45
... ...
and remove the multiple space between to 1 space:
:%s/\s\{2,}/ /g
I would use complex repeats :)
Given this:
aaa
bbb
ccc
AAA
BBB
CCC
With the cursor on the first line, press the following:
qa}jdd''pkJxjq
and then press #a (and you may subsequently use ##) as many times as needed.
You should end up with:
aaaAAA
bbbBBB
cccCCC
(Plus a newline.)
Explaination:
qa starts recording a complex repeat in a
} jumps to the next empty line
jdd deletes the next line
'' goes back to the position before the last jump
p paste the deleted line under the current one
kJ append the current line to the end of the previous one
x delete the space that J adds between the combined lines; you can omit this if you want the space
j go to the next line
q end the complex repeat recording
After that you'd use #a to run the complex repeat stored in a, and then you can use ## to rerun the last ran complex repeat.
There can be many number of ways to accomplish this. I will merge two blocks of text using any of the following two methods.
suppose first block is at line 1 and 2nd block starts from line 10 with the cursor's initial position at line number 1.
(\n means pressing the enter key.)
1. abc
def
ghi
10. 123
456
789
with a macro using the commands: copy,paste and join.
qaqqa:+9y\npkJjq2#a10G3dd
with a macro using the commands move a line at nth line number and join.
qcqqc:10m .\nkJjq2#c