Concatenate a special character to a column of data? - excel

9
37
92
93
96
98
118
128
135
136
139
I have about 13K plus records like the list above. And I want to append a ',' after every number?
What would be the best/easiest way to do this?

This sounds like a job for Notepad++!
Hit Ctrl+F and choose the Replace tab, then fill in these details:
Find what: \r\n
Replace with: ,\r\n
Search mode: Extended
Click Replace All, and Bob's your uncle!

Depending on your os and language:
Read a linw
Write all but the carriage return
Write ,
Write the carriage return
carry on at 1. unless you have finished.
On Unix/Linux boxes you could use sed or awk, just about any programming language, I wouldn't recommend asm or fortran but not impossible with them.
On any machine any language available that supports text processing, any utility that supports regular expressions. In excel you can create another column that concatenated the contents of the cell to the left and ',' then hide the original column

for data in column A1 you can use below
CONCAT(A1,CHAR(39), char(44),CHAR(39))
FYI,
CHAR(39) - '
CHAR(44) - ,

Related

How to conditionally replace numbers using vim

I can see a lot of discussion on how to replace strings conditionally using :%s. But I want to do exactly this in my file that has a huge bunch of numbers in a CSV format:
Find numbers < -100
Replace them by -98
How can I do this in VIM or any other editor/script language?
It is possible using the submatch() function like this:
:%s/-[0-9]\+/\=submatch(0) < -100 ? -98 : submatch(0)/g
Now every number smaller than -100 will be replaced by -98 and the rest just stays the same. Note that this regex will only match negative numbers.

How can I split a phrase into a new line every x characters on Google Sheets?

I am translating a game, and the game's text box only supports 50 characters max per line. Is there a way to use a formula to split the entire sentence every 50 characters or whole word (49, 48, 47, etc)?
I am currently working with this formula.
=JOIN(CHAR(10),SPLIT(REGEXREPLACE(A1, "(.{50})", "/$1"),"/"))
The problem with this code, is that it splits at exactly 50 characters (one time), and will split in the middle of the word.
So again, my goal is to have it not split on the 50th character IF the 50th character is in the middle of the word, and for the rule to apply for the rest of the lines too because it only applies on the first line.
Please take a look at this test google sheet to get an example of what I am talking about.
If it's impossible to do it on Google Sheets, I don't mind moving to Excel provided I get a functioning code.
For the record, I did ask in Google's product forums 2 days ago, and still haven't received an answer.
=REGEXREPLACE(A1, "(.{1,50})\b", "$1" & CHAR(10))
{50} matches exactly 50 times, but what you need is 50 or less.
\b is word boundary that matches between alphanumeric and non-alphanumeric character.
= REGEXEXTRACT(A1,"(?ism)^"&REPT("([\w\d'\(\),. ]{0,49}\s)", ROUNDUP(LEN(A1)/50,0))&"([\w\d'\(\),. ]{0,49})$")
Tested with various expressions and works as intended. Note that only these characters [a-zA-Z0-9_'(),.] are allowed, Which means - and other characters not mentioned will not work. If you need them, add them inside the REPT expression and finishing regexp formula. Otherwise, This will work perfectly.
You are pretty close. I'm not an expert in Sheets, so not sure if this is the best way, but your Regex is wrong for what you want.
Also, you need to be certain that you don't use a split character that might appear in the phrase itself. However, using CHAR(10) for the replace character allows you to insert LF without going through the JOIN SPLIT sequence.
replace any line feeds, carriage returns and spaces with a single space
Match strings that start with a non-Space character followed by up to 49 more characters which are followed by a space or the end of the string.
replace the capture group with the capturing group followed by the CHAR(10) (and delete the space following).
There will be extra CHAR(10) at the end which you can strip off.
EDIT Regex changed slightly due to a difference in behavior between Google's RE and what I am used to (probably has to do with how a non-backtracking regex works). The problem showed up on your example:
=regexreplace(REGEXREPLACE(REGEXREPLACE(A1 & " ","[\r\n\s]+"," "),"(\S.{0,49})\s","$1" & char(10)),"\n+\z","")

data file in horizontal format containing hidden characters

I have been provided a data file in a format I have never seen. The data do not appear to be in columns, but rather in one long row. I can open the file in Notepad and see the data. So, the data do not appear to be encrypted.
When I open the data file in Notepad the row of data wraps back to the to left side of the Notepad window when I guess the data reach the maximum number of characters that Notepad allowed in a single row, and then the data continue in a new row.
There might be 10,000 rows of data when I open the file in Notepad. The data in one of these rows are not aligned with the data in the row above it or below it.
Here are some example data:
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1304 3 0 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 0205 0 3 0
40001 1 5 GGGG 2998 HURG SU111111 95 1.0 F1 4 0805 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1205 0 2 0
40001 1 5 GGGG 2998 HHHH SU111111 95 1.0 F1 4 1505 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2003 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2303 2 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999 1.0 F3 4 2703 3 0 0
40002 2 8 GGGG 2998 PPPP SK777777 -999
Notice that when I paste the example data here, representing one row in Notepad, the columns are 'magically' aligned.
I have found that I can open the data file in Excel and the data are also aligned. I do need to manually assign column boundaries in Excel however. And Excel does not allow me to assign a column boundary beyond more-or-less Character Space 123.
Below is SAS code to read the data file, although this SAS code does not work correctly. Rather I guess this SAS code skips some of the data rows. Notice that the variable TT covers character spaces 125-207, but that there are only 120 characters in most rows. There are more than 120 characters in some rows. This difference in the number of characters among rows I suspect is the reason SAS cannot read this data file correctly.
option linesize = 210 ;
option pagesize = 30 ;
FILENAME myinput 'C:/Users/markm/simple SAS programs/mydata.new' ;
DATA mydata ;
INFILE myinput ;
INPUT
AA 2-9
BB 12-17
CC 18-22
DD $ 24-27
EE 30-33
FF $ 35-38
GG $ 40-47
HH 53-56
II 59-64
JJ $ 66-68
KK $ 70-71
LL 72-78
MM 79-85
NN $ 87-90
OO 91-95
PP 97-104
QQ 105-110
RR 112-120
SS $ 122-123
TT $ 125-207 ;
If I move the cursor to the right one character at a time over the first row of data using the right-arrow key I have to press the right-arrow key twice to move beyond character space 120 in Notepad.
All of this is telling me there are hidden characters in the data file used to identify the end of a line of data.
I opened the data file in Vim hoping to see these hidden characters, but did not see anything. Vim did align the columns correctly when I opened the file. So, Vim must be seeing these hidden end-of-line characters.
How can I see these end-of-line characters myself? I suspect there is an option in Vim to reveal the hidden characters.
How can I determine the application that created this data file?
How can I modify the above SAS code to read this data file correctly?
First off, double check your LRECL. You're missing basically half of your data, which makes me think you're reading in two lines for each line. You show 207 as your maximum line size, which should be under the default 256 LRECL, but seeing a number about 1/2 of the correct number makes me think you've made a mistake there.
Next, figure out if you are seeing basically every other line, or are you seeing the first 44k lines and then a sudden stop. If the latter, you have a DOS EOF character (1A) in the data, and you need to set the IGNOREDOSEOF option. If the former, then you have either an obvious LRECL problem as above, or you might have a nonobvious LRECL problem caused by unicode characters taking up multiple bytes (try LRECL=32767 and see if that fixes it; also would cause your data to look funny at some point in each line), or you have a weird line terminator problem (though an inconsistent one).
Then, assuming there is a problem with EOL characters (or EOF?), the way you approach this is to see exactly what is in your datafile.
Read in a dummy character, and then put the _infile_ line with hex. format. For example:
data test;
infile "d:\temp\utf8.txt" lrecl=256 RECFM=f;
input #1 x $1. #;
r = repeat('1234567890',8); *make this appropriate for your LS option in your log;
put r;
put _infile_;
put _infile_ hex512.;
stop; *we want to see just one line here;
run;
In that case i'm reading in 20 long lines, and using hex40., as it needs to be exactly double the line length. You can leave the length off (hex.) but you'll get some really long lines with tons of blanks if you do that. In your case, lrecl=207, you should use hex414. in theory (But might want to make your lrecl 256 and hex512. just in case). Since we're using RECFM=F, the idea is to have a LRECL longer than your real line length, so you can see a whole line in one run of this. (If one line doesn't tell you enough about this, use firstobs= to navigate to a later line, recognizing that if your LRECL is not exactly right for the data, you won't be skipping to the start of a true line, but skipping 256 byte chunks).
That will give you two strings, one the 'visible' string, which may be helpful for seeing what SAS thinks is at what spot, one the hex codes behind the visible string. The hex codes are 2 values per character (as one byte = 2 hex values), assuming you're in an ASCII environment (not a DBCS or Unicode environment). See this page for a list of ASCII codes.
Hex codes to look for:
1A = DOS EOF character.
0A = LF
0D = CR
If this is a Windows/Dos document, you should see CRLF consecutively at ends of lines, ie, 0D0A in a row, somewhere around 207. If this is a Unix document, you will see just 0A there. If this is a Mac OS document, you may see LFCR, or 0A0D. Why would anyone want to be consistent.
You probably will see something, since you're getting some number of lines. (If there was no line terminator, SAS would just give up after the first line.) You are more likely to have one of the following problems:
This is a DBCS file, so all characters really take up more than one byte. If you see a lot of 00 or 40 or 20 between characters (like, every single character has one), you have a DBCS (double byte character set) file - this is what, say, a Chinese or Japanese copy of Windows OS would likely produce. They use two bytes for every character in order to represent the full set of characters in their languages; but even when storing english documnets, they still use the full set - just adding a filler byte basically to still have reasonable ASCII appearance for noncompatible programs (or programs not set up properly, like SAS would be in this case).
This is a UTF-8 file, where characters may take multiple bytes (but may not). In this case you probably see some 'junk' in the data when viewing it this way, and every so often you get a character that takes up two or three spaces - often entirely full of 'junk' characters. UTF-8 can take between 1 and 4 bytes per character, usually powers of 2 (so 1,2,4) but will look 'normal' for ASCII characters (ie, it takes ASCII and adds a lot, making relatively few changes in the 00-7F range).
My gut is that you have a DBCS file, given you're skipping every other line roughly (though not exactly - and you are skipping MORE than that - which makes this a bit odd to me).
Here is how to see the hidden end-of-line characters in gVim 7.4:
Open gVim 7.4
Open the data file in gVim 7.4
Press the escape key a few times to access the line editor. Note pressing the escape key
will result in no visible result on the gVim 7.4 window.
Type :set list at the bottom of the gVim 7.4 window
Press the enter key
Once I did the above I saw a blue $ at the end of every line, which I assume is an end-of-line hidden character.
Maybe if I am able to remove these blue $ symbols and save the result under a new name SAS might be able to read that new data file. If I figure this out I will post an update.
EDIT
I tried to modify the instructions posted here by John Black to remove the $, but so far have had no luck: Read csv file with hidden or invisible character ^M
I typed :%s/$//g which replaced the blue $ with yellow $. Then I saved the file under a new name and opened the new file with gVim. But when I typed :set list the blue $ were still present in the new file.

Allign the words to the specified column in vim using commands

How I can move or shift the words in the entire file to the specified column?
For example like below:
Before :
123 ABC
112 XYZS
15925 asdf
1111 25asd
1 qwer
After :
123 ABC
112 XYZS
15925 asdf
1111 25asd
1 qwer
How it can be done using command mode?
Here the thing is we need to shift the 2nd word to the specified column
Here the specified column is 8
except for vim-plugins mentioned by others, if you were working on a linux box with column command available, you could just :
%!column -t
% could be vim ranges, e.g. visual selections etc..
Approach with built-in commands
First :substitute the whitespace with a Tab character, and then :retab to a tab stop to column 8, expanding to spaces (for your given example):
:.,.+4substitute/\s\+/\t/ | set tabstop=7 expandtab | '[,']retab
(I'm omitting the resetting of the modified options, should that matter to you.)
Approach with plugin
My AlignFromCursor plugin has commands that align text to the right of the cursor to a certain column. Combine that with a :global command that invokes this for all lines in the range, and a W motion to go to the second word in each, and you'll get:
.,.+4global/^/exe 'normal! W' | LeftAlignFromCursor 8
I use the Tabular plugin. After installing it, you run the following command:
:%Tab/\s
where \s means whitespace character
I have made two functions for this problem.
I have posted it here : https://github.com/imbichie/vim-vimrc-/blob/master/MCCB_MCCE.vim
We need to call this function in vim editor and give the Number of Occurrence of the Character or Space that you wants to move and the character inside the '' and the column number.
The number of occurrence can be from the starting of each line (MCCB function) or can be at the end of each line (MCCE function).
for the above example mentioned in the question we can use the MCCB function and the character we can use space, so the usage will be like this in the vim editor.
:1,5call MCCB(1,' ',8)
So this will move the first space (' ') to the 8th column from line number 1 to 5.

vim: how to replace something only in a column

I want to replace to symbols ":" and "-" both to tab in the first column of a file:
The input is like:
chr1:100-200 1 2 3e-4
chr2:300-400 4 5 6e-4
And I want the output to be:
chr1 100 200 1 2 3e-4
chr2 300 400 4 5 6e-4
I know how to do replacement globally by "%s/:/^I/g" to replace ":" to a tab.
But because some of the entries have numbers in scientific notation such as 3e-4, I can not just use "%s/-/^I/g" to replace "-" to a tab.
Does anyone know how to specify replacement only to the first column?
Thanks.
You can use "Ctrl+V", active the "VISUAL BLOCK" mode, select the columns to be changed, press "c", make the change, and then "Esc", it will be applied to all selected columns.
How about don't use the g so
%s/-/^|/
If only the e is the problem, you can use %s/\([^e]\)-/\1^I/g to find -'s not prepend by e.
For the special case of the first column, you can indeed just leave off the g flag. For a general solution that works in any column, establish a blockwise visual selection with <C-v> (often <C-q> on Windows), then restrict the substitution to the visual selection with the \%V atom:
:%s/\%V-/\t/

Resources