This is my very first question linking to my first Python project.
To put it simple, I have 2 columns of data in Excel like this (first 6 rows):
destination_area | destination_code
SG37.D0 | SG37.D
SG30.C0 | SG30.C
SG4.A3.P | SG4.A
SG15.C16 | SG15.C
SG35.D02 | SG35.D
SG8.A5.BC | SG8.A
So in Excel, I'm using a function to get destination code by finding first "." in the cell & return all characters from the left of it, plus 1 character:
=IfError(left(E2,search(".",E2)+1),"")
Now I want to execute it using str.extract
df1['destination_code'] = df1['destination_area'].str.extract(r"(?=(.*[0-9][.][A-Z]))", expand = False)
print(df1['destination_area'].head(6),df1['destination_code'].head(6))
I almost got what I need but the code still recognize those that have more than 1 "."
destination_area | destination_code
SG37.D0 | SG37.D
SG30.C0 | SG30.C
SG4.A3.P | SG4.A3.P
SG15.C16 | SG15.C
SG35.D02 | SG35.D
SG8.A5.BC | SG8.A5.BC
I recognize that my regex is understanding the pattern of {a number + "." + a letter}, which returns all characters for the cases of "SG4.A3.P" and "SG8.A5.BC".
So how to modify my code? Or any better way to perform the code like how Excel does? Thanks in advance
No need in lookahead. Use
df1['destination_code'] = df1['destination_area'].str.extract(r"^([^.]+\..)", expand=False)
See proof. Mind the capturing group, it is enough here to return the value you need.
Explanation:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^.]+ any character except: '.' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
) end of \1
Related
Below is a listing of some cells with unnecessary text. The text to remove would be /%%, -, and empty spaces.
Text and Result
| Text | Result |
|:--------|:---------|
| DW80R201UB/AA| DW80R201UB |
| DW80R201UW/AA| RDW80R201UW |
| DWT24PNA12| RDWT24PNA12 |
| DV-2A/XAA| RDV2A |
| 1DV-MCK/A1| RDVMCK |
| 1HAFCU1/XAA| RHAFCU1 |
| HAF-CIN/EXP| RHAFCIN |
For entries with the forward slash, I use =SUBSTITUTE(A1,RIGHT(A1,LEN(A1)-FIND("/",A1)+1),"") since there can be more than one character after the forward slash.
For everything else, I would use =SUBSTITUTE(SUBSTITUTE(A1,"-","")," ","").
I'll usually use the first formula, and then filter the column to only get #VALUE results and use the second formula. I'm just wondering if there is an easier way to get all the models with one nested function.
Take all characters to the left of a forward slash. If there's no forward slash, then take the original value. From there, substitute any dash or space with an empty string.
=SUBSTITUTE(SUBSTITUTE(IFERROR(LEFT(A1,FIND("/",A1,1)-1),A1),"-","")," ","")
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1;"/";"");"-";"");" ";"");"%";"");";";"")
change semikolon to comma
This will remove all the charakters at once.
Your first formula is not working for me.
So lets say that in one row i have in 2 cells some data and I want to extract the data after the second "_" character:
| | A | B |
|---|:----------:|:---------------------:|
| 1 | 75875_QUWR | LALAHF_FHJ_75378_WZ44 | <- Input
| 2 | 75875_QUWR | 75378_WZ44 | <- Expected output
I tried using =RIGHT() function but than i will remove text from this first cell and so on, how can i write this function? Maybe I would compare this old cell and than to do if the second row is empty because maybe function deleted it to copy the one from first? No idea
Try:
=MID("_"&A1,FIND("#",SUBSTITUTE("_"&A1,"_","#",LEN("_"&A1)-LEN(SUBSTITUTE("_"&A1,"_",""))-1))+1,100)
Regardless of the times a "_" is present in your string, it will end up with the last two "words" in your string. Source
Use following formula.
=TRIM(MID(A1,SEARCH("#",SUBSTITUTE(A1,"_","#",2))+1,100))
Given the following examples,
16A6
ECCB15
I would only like to extract the last number or numbers from the string value. So the end result that I'm looking for is:
6
15
I've been trying to find a way, but can't seem to find the correct one.
Use thisformula:
=MID(A1,AGGREGATE(14,7,ROW($Z$1:INDEX($ZZ:$ZZ,LEN(A1)))/(NOT(ISNUMBER(--MID(A1,ROW($Z$1:INDEX($ZZ:$ZZ,LEN(A1))),1)))),1)+1,LEN(A1))
Try this:
=--RIGHT(A2,SUMPRODUCT(--ISNUMBER(--RIGHT(SUBSTITUTE(A2,"E",";"),ROW(INDIRECT("1:"&LEN(A2)))))))
or this (avoid using INDIRECT):
=--RIGHT(A2,SUMPRODUCT(--ISNUMBER(--RIGHT(SUBSTITUTE(A2,"E",";"),ROW($A$1:INDEX($A:$A,LEN(A2)))))))
Replace A2 in the above formula to suit your case.
Here are the data for testing:
| String |
|-----------|
| 16A6 |
| ECCB15 |
| BATT5A6 |
| 16 |
| A1B2C3E0 |
| 16E |
| TEST00004 |
I have an even shorter version: --RIGHT(A2,SUMPRODUCT(--ISNUMBER(--RIGHT(SUBSTITUTE(A2,"E",";"),ROW(INDIRECT("1:"&LEN(A2)))))))
The difference is the use of SUBSTITUTE in my final formula. I used SUBSTITUTE to replace letter E with a symbol because in the fifth string in the above list, the RIGHT function in my formula will return the following: {"0";"E0";"3E0";"C3E0";"2C3E0";"B2C3E0";"1B2C3E0";"A1B2C3E0"} where the third string 3E0 will return TRUE by ISNUMBER function, and this will result in an incorrect answer. Therefore I need to get rid of letter E first.
Let me know if you have any questions. Cheers :)
I've searched for a while, but it looks like all the examples I find are the opposite of what I need. There are many ways to see if a string with wildcards matches any of the values in an array, but I need to go the other way - I need the array to contain wildcards, and check if the string in the target cell matches any of the match strings in the array, but the match strings can contain wild cards.
To put it in context, I am parsing large log files, and there are many lines I wish to ignore (but not delete); so I have a helper column:
+---+-------+----------------------------------------+----------------------------+
| | A | B | C (filter for = FALSE) | Requirement
+---+-------+----------------------------------------+----------------------------+
| 1 | 11:00 | VPN Status | =COUNTIF(IgnoreList,B1)>0 + Keep
| 2 | 11:05 | Log at event index 118, time index 115 | =COUNTIF(IgnoreList,B2)>0 + Ignore
| 3 | 11:20 | Log at event index 147, time index 208 | =COUNTIF(IgnoreList,B3)>0 + Ignore
+---+-------+----------------------------------------+----------------------------+
I've tried to put wildcards in my IgnoreList range to catch any of the "Log at event" lines:
+--------------------------------------+
| IgnoreList +
+--------------------------------------+
| State Runtime 1 +
| State Runtime 2 +
| State Runtime 3 +
| State Runtime 4 +
| Log at event index *, time index * +
+--------------------------------------+
... but this isn't working.
Does anyone know how to check a cell against an array containing wildcards?
My IgnoreList has 60 entries so far, so testing each cell individually isn't really feasible. I could have 30,000 or more entries in the log, so individual testing will be a lot more formulas than I'd hoped to use. I also don't want to edit the formulae when I add an entry to the IgnoreList.
Thanks for your help!
Use SEARCH, which allows wild card lookups, inside SUMPRODUCT:
=SUMPRODUCT(--ISNUMBER(SEARCH(IgnoreList,B1)))>0
To use COUNTIF one would need to reverse the criteria and wrap in SUMPRODUCT:
=SUMPRODUCT(COUNTIF(B1,IgnoreList))>0
Sample Input File
+--------------------+---------+---------
| Name | S1 | S2
+--------------------+---------+---------
| A | -4.703 | -2.378
| B | -3283.2 | -3204.5
| C | 8779 | 7302
| D | 22078 | 18018
+--------------------+---------+---------
It is required to remove the S1 Column, i.e
Desired Output
+--------------------+---------
| Name | S2
+--------------------+---------
| A | -2.378
| B | -3205.5
| C | 7302
| D | 18018
+--------------------+---------
Can anyone help with this
thanks
Look, ma: no visual (block) mode !
My pragmatic approach wins would be: look for column anchors (-+-)
/-+-
Now, the column deletion is as simple as
d<C-v>N
(delete, block-wise, to the next occurrence of the column anchor from the end of the document).
Job done.
Fancy options
To account for multiple columns, you'd like to be precise about which column to match
This needs a little extra oomph
0f+
:exec '/\%' . col('.') . 'v\v[+|]'Enter
NC-vN
t+d
To see more about this \%22v way to select a virtual column, see
Support in vim for specific types of comments
In command mode:
:%s/^\([[+|][^+|]\+\)[+|][^+|]\+/\1/
This uses vim's built-in sed-like search and replace command. Here's the breakdown:
% - for the entire file
s - search for
/^ - the start of line
\([[+|][^+|]\+\) - followed by + or |, followed by any number (\+) of anything that is not + or |. This will get the first column, which we want to keep, so put it in a capture group by surrounding it with \( and \)
[+|][^+|]\+ - followed by + or |, followed by any number (\+) of anything that is not + or |. This will get the second column, which we don't want to keep, so no capture group.
/\1/ - replace everything we matched with the first capture group (which contains the first column). This effectively replaces the first and second column with the contents of the first column.
Like I said, vim's regex are pretty much identical to sed, so you if you look through this tutorial on sed you'll probably pick up a lot of useful stuff for vim as well.
Edit
In response to the OP's request to make this more generally capable of deleting any column:
:%s/^\(\([[+|][^+|]\+\)\{1\}\)[+|][^+|]\+/\1/
The index inside of the \{\}, now deletes the column indicated. Think of it like an array index (i.e. starts at zero). So \{0\} now deletes the first column, \{1\} deletes the second, and so on.
I would like to write Mathias Schwarz's comment into an answer because Visual Mode is the natural way for the task, although there is already an accepted answer.
Assuming cursor is in ¶
+--------------------+¶--------+---------
| Name | S1 | S2
+--------------------+---------+---------
| A | -4.703 | -2.378
| B | -3283.2 | -3204.5
| C | 8779 | 7302
| D | 22078 | 18018
+--------------------+---------+---------
Use normal command Ctrl-V8jf+d to select S1 column and delete it. Explanation:
Ctrl-V: Enter in Blockwise Visual Mode.
8j: Eigth is the number of rows of the table, it sets cursor at same column but last line.
f+: Move cursor until next + character.
d: delete visual selection.
And result is:
+--------------------+---------
| Name | S2
+--------------------+---------
| A | -2.378
| B | -3204.5
| C | 7302
| D | 18018
+--------------------+---------
If this is the only content of the file, the simplest way is to use this:
:%normal 22|d32|
IF there is more text in the file, specifies the line interval:
:X,Ynormal 22|d32|
Where X and Y is the line interval, for example: 10,17normal 22|d32|
If you're not familiar with the normal command and with the | "motion" there goes a quick explanation:
The normal command execute the following commands in the normal mode;
The | "motion" moves the cursor to a specified column, so 22| moves the cursor to the 22nd column;
Basically what :X,Ynormal 22|d32|does is to move the cursor to the 22nd column (22|) and deletes everything (d) until the 32nd column (32|) for every line specified by X and Y.
Based on patterns of your table, this can be achieved in two simple commands:
:%norm 2f+dF+
:%norm 2f|dF|
Where 2 is your column to remove (1 will remove 1st, 3 - 3rd).
This works as below (for each line at once):
Find second corresponding character of the column (2f+ or 2f|).
Delete backwards to the next found character of the column (dF+ or dF|).
Here is command line approach removing 2nd column in-place:
$ ex +'%norm 2f+dF+' +'%norm 2f|dF|' -scx cols2