Related
The attached image (link: https://i.stack.imgur.com/w0pEw.png) shows a range of cells (B1:B7) from a table I imported from the web. I need a formula that allows me to extract the names from each cell. In this case, my objective is to generate the following list of names, where each name is in its own cell: Erik Karlsson, P.K. Subban, John Tavares, Matthew Tkachuk, Steven Stamkos, Dustin Brown, Shea Weber.
I have been reading about left, right, and mid functions, but I'm confused by the irregular spacing and special characters (i.e. the box with question mark beside some names).
Can anyone help me extract the names? Thanks
Assuming that your cells follow the same format, you can use a variety of text functions to get the name.
This function requires the following format:
Some initial text, followed by
2 new lines in Excel (represented by CHAR(10)
The name, which consists of a first name, a space, then a last name
A second space on the same line as the name, followed by some additional text.
With this format, you can use the following formula (assuming your data is in an Excel table, with the column of initial data named Text):
=MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])),SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])))+1)-1)
To come up with this formula, we take the following steps:
First, we figure out where the name starts. We know this occurs after the 2 new lines, so we use:
=SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1
The inner (occurring second) SEARCH finds the first new line, and the outer (occurring first) finds the 2nd new line.
Now that we have that value, we can use it to determine the rest of the string (after the 2 new lines). Let's say that the previous formula was stored in a table column called Start of Name. The 2nd formula will then be:
=MID([#Text],[#[Start of Name]],LEN([#Text]))
Note that we're using the length of the entire text, which by definition is more than we need. However, that's not an issue, since Excel returns the smaller amount between the last argument to MID and the actual length of the text.
Once we have the text from the start of the name on, we need to calculate the position of the 2nd space (where the name ends). To do that, we need to calculate the position of the first space. This is similar to how we calculated the start of the name earlier (which starts after 2 new lines). The function we need is:
=SEARCH(" ",[#[Rest of String]],SEARCH(" ",[#[Rest of String]])+1)-1
So now, we know where the name starts (after 2 new lines), and where it ends (after the 2nd space). Assuming we have these numbers stored in columns named Start of Name and To Second Space respectively, we can use the following formula to get the name:
=MID([#Text],[#[Start of Name]],[#[To Second Space]])
This is equivalent to the first formula: The difference is that the first formula doesn't use any "helper columns".
Of course, if any cell doesn't match this format, then you'll be out of luck. Using Excel formulas to parse text can be finicky and inflexible. For example, if someone has a middle name, or someone has a initials with spaces (e.g. P.K. Subban was P. K. Subban), or there was a Jr. or something, your job would be a lot harder.
Another alternative is to use regular expressions to get the data you want. I would recommend this thorough answer as a primer. Although you still have the same issues with name formats.
Finally, there's the obligatory Falsehoods Programmers Believe About Names as a warning against assuming any kind of standardized name format.
I have the following control cards that I can't understand how to read. Could someone help me traduce what this part of a JOB is performing?
OUTFIL FNAMES=(XSCB),BLKCCT1,INCLUDE=(67,7,CH,EQ,
C'XSCB ',OR,69,7,CH,EQ,
C'XSCB '),
HEADER2=(22:C'XSCB MVS USERID SYSTEM USAGE REPORT',/,
01:C'GENERATED ON ',&DATE=(MD4/),70:C'PAGE',&PAGE,/,
01:C' AT ',&TIME,/,X,/,
01:C'JULIAN',/,
01:C'DATE TIME SYSTEM JOB MESSAGE',/,
01:C'-------- -------- ------ -------- ---------------->'),
TRAILER1=(X,/,01:C'RECORDS FOUND =',COUNT,/,34:C'END OF REPORT'),
OUTREC=(20,07,ZD,EDIT=(TTTT.TTT),X, * JULIAN DATE
28,08,X, * TIME
11,06,X, * SYSTEM
40,08,X, * JOB OR REF
59,07,CHANGE=(50,C'IEF125I',C'LOGGED ON ', * MESSAGE
C'IEF126I',C'LOGGED OFF'),
NOMATCH=(79,50),
132:X)
I understand that it searches the ID 'XSCB' in the position 67 or 69. But once it finds it, I cannot interpret what it does next.
Those are SORT control cards. If you look at the SYSOUT for the step, and pay attention to the messages, you will be able to tell if it is DFSORT (messages prefixed by ICE) or SyncSORT (messages prefixed by WER).
Your step may be EXEC PGM=SORT or ICEMAN or something else, depends on your site.
The control cards are producing a report. You have at least one line missing from your control cards (OPTION COPY, or SORT FIELDS=COPY or a different SORT or MERGE statement). There could be any number of missing cards, and you possibly have another output from the step. Otherwise the OUTFIL INCLUDE= could perhaps be a plain INCLUDE COND=.
What does what you have shown actually do?
OUTFIL defines final processing for a particular output data set. With no name, it would be for the SORTOUT DD in your JCL.
With FNAMES=(XSCB) it is for a DD named XSCB in your JCL. For a single name specified in FNAMES, the brackets are redundant.
BLKCTT1 says "put a blank in column one to not get a page-eject from TRAILER1 output".
The INCLUDE= is as you suspect. Testing two different starting positions for the same value. If either test is true, the current record will be included in the OUTFIL group.
HEADER2 defines what appears at the top of each page.
The 01: is a column-number, and is redundant, as each line by default starts are column one.
HEADER2 can create multiple lines (as can any HEADERn or TRAILERn and BUILD (or OUTREC, but don't use it for new) on OUTFIL), each separated by "/". &DATE, &TIME and &PAGE are special, containing the obvious. &DATE can be formatted in various ways, MD4/ is MM, DD, YYYY separated by slashes.
The X is a blank, on a line of its own. You could equally see .../,/... or n/ to create n multiple blank lines.
The constants should be obvious.
TRAILER1 defines what is printed at the end of the report.
COUNT is the number of records in the OUTFIL group, here used with no formatting, but it can be formatted.
The 34: column-number means the items following will start from column 34.
The OUTREC is better spelled as BUILD. OUTREC exists elsewhere. BUILD has been around for more than 10 years, so no need to use OUTREC on OUTFIL in new code (maybe this is old anyway).
What the BUILD would do is format the current input record into what is desired for an output line on the report.
The numbers in pairs are start-position and length of fields. Where no field-type is defined, they are (treated as) character fields.
You have one field-type, ZD, which is zoned-decimal. Its length is seven, and an EDIT mask is used, four digits, full-stop (decimal-point) and then three digits.
The Xs as previously are blanks, used as separators on the report. The content of each field is described in a comment. A comment is any text after the end of a control card. A control card ends at the blank after the statement is complete, or where a there is a blank after a possible continuation (a comma or a colon are possible continuations).
132:X puts a blank in column 132, and pads any intervening columns from the last field or constant with blanks.
That leaves the CHANGE=.
CHANGE= is a very useful test-and-replace.
79,50,CHANGE=(50,C'IEF125I',C'LOGGED ON ', * MESSAGE
C'IEF126I',C'LOGGED OFF'),
NOMATCH=(79,50)
This says "at the current column of the record being created, consider the content of the input from position 79 for a length of 50. The output length will be 50. If IEF125I, then use the constant LOGGED ON, if IEF126I use LOGGED OFF, and else (NOMATCH) use whatever is at position 79 for a length of 50 from the input.
Basically, the report is using the system log, or an extract from it, to report activity related to the Userid/Logon XSCB.
I am making a function in python 3.5.2 to read chemical structures (e.g. CaBr2) and then gives a list with the names of the elements and their coefficients.
The general rundown of how I am doing it is i have a for loop, it skips the first letter. Then it will append the previous element when it reaches one of: capital letter/number/the end. I did this with index of my iteration, and then get the entry with index(iteration)-1 or -2 depending on the specifics. For the given example it would skip C, read a but do nothing, reach B and append to my name list the translation of Ca, and append 1 to my coefficient list.
This works perfectly for structures with unique entries, but with something like CaCl2, the index of the iteration at the second C is not 2, but zero as index doesn't differentiate between the two. How would I be able to have variables in my function equal to the value at previous index(es) without running in to this problem? Keeping in mind inputs can be of any length, capitalization cannot change, and there could be any number of repeated values
I have many Import files, which look like this
So there are sales values per Team Member, but NO period inside.
The period is coded in the Path like:
AllData\201501\Revenues.txt
AllData\201502\Revenues.txt
AllData\201503\Revenues.txt
I want to have the Periode from the path on each data row, so my final output table should look like this:
So I must bring the period from the path inside the file anyway.
The question how to access the path is solved in perfect example here:
How can I save a path criteria when I import from folders?
But there I have still the period on the "whole" text, not on the row.
In the linked question you can change the custom column formula from:
Text.FromBinary([Content])
to
Text.Split(Text.FromBinary([Content]), "#(000a)")
(depending on how line breaks are represented, you may need to use "#(000a)#(000d)" instead).
This will split the text at each new line, and you'll get a list of the name;value pairs. Click on the box with the two arrows next to the column name to expand the column. Each row should now have the period associated with the name;value pair. Finally, split the column by delimiter on the semicolon to separate the name from the value.
There are 2 options, both involve horrible looking equations.
First option, we assume the paths are going to have the period in the same position in the string.
for the example, we want the number between the 1st and 2nd slashes.
=TRIM(LEFT(SUBSTITUTE(MID(A1,FIND("|",SUBSTITUTE(A1,"\","|",1))+1,LEN(A1)),"\",REPT(" ",LEN(A1))),LEN(A1)))
If it's between a different set of slashes, alter the ,1 to tell the formula which slash to start from. If the number of slashes can be different, then we will have to try for the second option.
Second option, we assume that those are the only numbers in the path.
This formula will extract those numbers:
=SUMPRODUCT(MID(0&A1,LARGE(INDEX(ISNUMBER(--MID(A1,ROW($1:$25),1))* ROW($1:$25),0),ROW($1:$25))+1,1)*10^ROW($1:$25)/10)
Note that this will extract all the numbers from the string. If the path contains numbers, then these will get added to the string. e.g. C:\2014Data\201401\Revenues.txt would return 2014201401
If this doesn't take care of it, then it may be easier putting a column into the table yourself
I have a list of images stored in a directory. They are all named. My GUI reads all the images and saves their names in a cell array. Now I have added a editable box that the user can type in a name and the program will show that image. The problem is I want the program to take into account typos and misspellings by the user and find the most similar file name to the user typed word. Can you please help me?
Many Thanks,
Hamid
You should read this WP article: Approximate string matching and look at "Calculation of distance between strings" on FEx.
I think you should use the longest common subsequence algorithm to approximately compare strings.
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/24559-longest-common-subsequence
After, just do something like that:
[~,ind]=min(cellarray( #(x) LCS(lower(userInput),lower(x)), allFileNames));
chosenFile=allFileName{ind};
(the function LCS is the longest common subsequence algorithm, and the functionlower converts to lower case)
Not exactly what you are looking for, but you can compare the first few characters of the strings ignoring case to find a close match. See the command strncmpi:
strncmpi Compare first N characters of strings ignoring case.
TF = strncmpi(S,C,N) performs a case-insensitive comparison between the
first N characters of string S and the first N characters in each element
of cell array C. Input S is a character vector (or 1-by-1 cell array), and
input C is a cell array of strings. The function returns TF, a logical
array that is the same size as C and contains logical 1 (true) for those
elements of C that are a match, except for letter case, and logical 0
(false) for those elements that are not. The order of the two input
arguments is not important.