Trimming strings in excel for different words - excel

Scenario: I have some rows with string of data in excel. The data is always on the same order ("columns") but the size of the data in each "column" varies. In the original strings, there can be one or multiple blank spaces between each piece of "column" data, and so far I used the trim function to reduce that to 1 blank space.
Objective: I am trying to somehow separate the data from the string in different columns, but inside each column data, there might also be spaces, for example I am trying to output this original:
James Smith code1 code2 10.5 09/23/1900AT PRESENT UUUB SJ SPECIAL 250AAA No No NoCORRECTED part1
to this with trim:
James Smith code1 code2 10.5 09/23/1900AT PRESENT UUUB SJ SPECIAL 250AAA No No NoCORRECTED part1
as this:
James Smith code1 code2 10.5 09/23/1900 AT PRESENT UUUB SJ SPECIAL 250AAA No No No CORRECTED part1
where each field is in its proper column.
Obs1: One of the problematic fields for me is the one that has the result "AT PRESENT", because there is a space in between, and there is no space between the "AT" and the last digit of the previous column.
Obs2: I also face similar problems in the first row (headers), which also can have more than 1 work per field.
Obs3: Here are two other string examples that appear in the dataset:
code1 03/15/1950TEAM-ALPHA h/s/s CERTIFIED3-3/1 third point 03/19/1944 -- --SR SR Prototype
code1 200000.00especial reduced Redone third part -- No
What I already tried: I have been trying the LEFT, RIGHT and MID functions, but since I cannot foresee how many letters will be in most of the fields, I found no proper way to do it. I also tried doing simple character substitution, but that does not solve the problem of the fields that are mistakenly merged. The first thing I tried was using "text to columns": here the result is also problematic, if I have spaces inside a field it gets divided, and if there is no space between fields, there will also be an error. I am tried to to this as dynamically as possible, to account for different data variants.
Question: Any suggestions or ideas on how to tackle this situation?

Have you tried Text to Columns on Data tab?
Set your original data type delimited and select the "Space" delimiter. Make sure you tick "Treat consecutive delimiters as one"

Related

Extracting text from complex string in excel

The attached image (link: https://i.stack.imgur.com/w0pEw.png) shows a range of cells (B1:B7) from a table I imported from the web. I need a formula that allows me to extract the names from each cell. In this case, my objective is to generate the following list of names, where each name is in its own cell: Erik Karlsson, P.K. Subban, John Tavares, Matthew Tkachuk, Steven Stamkos, Dustin Brown, Shea Weber.
I have been reading about left, right, and mid functions, but I'm confused by the irregular spacing and special characters (i.e. the box with question mark beside some names).
Can anyone help me extract the names? Thanks
Assuming that your cells follow the same format, you can use a variety of text functions to get the name.
This function requires the following format:
Some initial text, followed by
2 new lines in Excel (represented by CHAR(10)
The name, which consists of a first name, a space, then a last name
A second space on the same line as the name, followed by some additional text.
With this format, you can use the following formula (assuming your data is in an Excel table, with the column of initial data named Text):
=MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])),SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])))+1)-1)
To come up with this formula, we take the following steps:
First, we figure out where the name starts. We know this occurs after the 2 new lines, so we use:
=SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1
The inner (occurring second) SEARCH finds the first new line, and the outer (occurring first) finds the 2nd new line.
Now that we have that value, we can use it to determine the rest of the string (after the 2 new lines). Let's say that the previous formula was stored in a table column called Start of Name. The 2nd formula will then be:
=MID([#Text],[#[Start of Name]],LEN([#Text]))
Note that we're using the length of the entire text, which by definition is more than we need. However, that's not an issue, since Excel returns the smaller amount between the last argument to MID and the actual length of the text.
Once we have the text from the start of the name on, we need to calculate the position of the 2nd space (where the name ends). To do that, we need to calculate the position of the first space. This is similar to how we calculated the start of the name earlier (which starts after 2 new lines). The function we need is:
=SEARCH(" ",[#[Rest of String]],SEARCH(" ",[#[Rest of String]])+1)-1
So now, we know where the name starts (after 2 new lines), and where it ends (after the 2nd space). Assuming we have these numbers stored in columns named Start of Name and To Second Space respectively, we can use the following formula to get the name:
=MID([#Text],[#[Start of Name]],[#[To Second Space]])
This is equivalent to the first formula: The difference is that the first formula doesn't use any "helper columns".
Of course, if any cell doesn't match this format, then you'll be out of luck. Using Excel formulas to parse text can be finicky and inflexible. For example, if someone has a middle name, or someone has a initials with spaces (e.g. P.K. Subban was P. K. Subban), or there was a Jr. or something, your job would be a lot harder.
Another alternative is to use regular expressions to get the data you want. I would recommend this thorough answer as a primer. Although you still have the same issues with name formats.
Finally, there's the obligatory Falsehoods Programmers Believe About Names as a warning against assuming any kind of standardized name format.

Vlookup Not working on text between two tables

This is not your average vlookup error.
I have two Power Query tables that I've setup. One is coming from a CSV file with a list of names. The other is from a website pulling a list of names.
i.e.
=John Smith = John Smith would not be true for some reason.
They vlookup should be able to find the name easily. I've tried proper,upper, clean, trimming and text to columns and everything else that I could think of. I've changed data types to no avail.
I know that one query is causing the issue. I can type the name exactly and do a vlookup from one, and it works. The second query that I do this to doesn't return anything on the typed text.
Anyone encounter this issue while using Power Query?
EDIT: See Jeeped's Answer - When I replace the space from the web query with a normal space it works.
#Jeeped's comment has a good answer:
Assuming you have already trimmed off leading and trailing spaces, one of the John Smith entries (likely the one from the web) uses a non-breaking space (e.e. CHAR(160) or ASCII 0×A0) instead of a regular space (e.g CHAR(32) or ASCII 0×20). Use
=CODE(MID(A$1, ROW(1:1), 1))
on both, fill down to get a ASCII code for each letter and compare the numbers.

separate Last Name, First Name and Middle Initial in three different columns

I have a file which contains Last Name, First Name MI for about 5000 people.
I need to split them in 3 different columns.
The issue I am facing is , that sometimes there are more than 1 first names, for example I have a person as Davis, Mary Ann L.
I want Davis in one column.
Mary Ann in another column and L in the 3rd column. Basically check if after the comma the number of characters is greater than 1. If it is greater than 1 then consider it as first name. If number of characters is equal to 1, then consider it Middle Initial.
How can I achieve this?
In your case, I would do a first approach by using the "Text to Column" command. Just mark the whole column, then choose Data -> Text to Column. Choose "delimited", then next, then select "Space".
After this, I would look through the processed data and get a picture. I assume that most records will be ok already now. And those records which are exceptions to the standard should be easily identifyable. You could even filter for them.
Only then, in a third step, I'd write a formula which processes the columns you have created in the first step.
Or, possibly a formula is not necessary at all. Possibly you can just easily filter and process some of the exceptions manually.

Mid Function for Microsoft Excel to obtain column .txt file

Captain Morgan ------ Insane Journeys -------- A-
I have easily gotten the left and right side parts using Left() and Right() functions.
I want to use a function in excel (not vba) that will allow me to get the middle phrase in this sentence (The dashes are really excessive spaces). can I accomplish this with a Mid() function?
This is just 1 item on a list of 80 different things in 1 column that needs to be turned into 3 columns. Every item has different character lengths. So the length counts cannot be manually entered.
I agree with Text to Columns but the image in the other answer only has one space per row while OP has some spaces that are redundant and some that are not. For this I’d suggest a modified approach:
Replace all pairs of spaces with a character unlikely to be encountered – I’d suggest a pipe.
Apply Text to Columns with pipe as delimiter.
Apply TRIM to the middle column to remove any remaining redundant spaces (eg =TRIM(B1) copied down and then that column pasted as values over the source).
But to answer can I accomplish this with a Mid() function? I think yes though not cost effective for a mere 80 entries when there is a viable alternative.
Try to use "Text to columns" from Data Tab. It has option to split data to different columns using various criteria.
All you need to do is select data you want to split to columns and select criteria you need.
In your case it can be either Space or Other:. When you select Other: you can add your own criteria like "space dot space" or anything you need.
For more detailed information you can enter this link.

Sort same data into columns

EDIT: Thanks for all the responses everyone. I'm going to go ahead and try and write rules to cover as many of the cases as I can, and either manually extract or try to right more rules to cover everything else.
I am trying to sort the same "types" of data into the same columns. Essentially, I get a data dump where a bunch of data (year, company name, person name, IO number, PO number, project description, and a bunch of comments) dumps into one single column, like this:
The ideal end result would be sorting so that same type of data in the same columns, i.e. all years in column A, all IOs in column B, all POs in column C, all person names in column D, all company names in column E, and whatever is left is dumped into a "comments" section in column F.
I've written a macro that employs the SUBSTITUTE function so that it goes through this string and substitutes all dashes and backslashes with commas, then separates based on the comma delimitor, then re-pastes the text as plain-text. This works fairly well, except for in the occasional case where there are dashes in the name of a company or a backslash to indicate two people who own that IO/PO or when all of the data is entered in without any delimitor such as: 2012 Company project title IO ##### PO #### Person Name.
So here is what I am asking:
1. Is there a better way to parse the data than I am doing now? How can I accommodate for the exceptions such as a dash in the company name or a string where there are no dashes or backslashes, only spaces?
2. Once I have parsed all of this data and separated it into separate columns like so:
how do I sort it so that the same type of information is in the same column?
Any help would be greatly appreciated. Please let me know if anything was unclear.
Welcome to StackOverflow!
If the text follows clear rules, like a separator as "-" or "," you can use the Split() function to get an array of tokens. If the text doesn't follow any rule it's impossible. Very likely you are in the middle, where most of the texts follow the rules. For the other texts, you need to massage your code and try to find new rules and check them with... see below.
Create a few functions IsYear(), IsPO(), IsCompany() that return True if the content is recognized. The functions could be as simple as IsYear = Text Like "20##" or could contain many tests. Then you make a function that checks each cell of each row, and sorts if required.
I'm sorry I can't give you anything more than some generic advice, but this is a very open question for a very challenging problem.
I hope this gets you started.
Along the lines of at #Werner “You can’t make a silk purse …” Obviously the solution is to lean on whoever is responsible for the garbage in to ensure that your source data is in better shape. However I guess you are looking for a workaround. From your example, some ‘tiding’ is possible. Eg sort on ColumnB and where 2012 is in ColumnC exchange the contents of B and C for that row. Then sort on ColumnD and do much that same for D and E. If ColumnF contains Quote insert a blank cell and shift to the right. If ColumnF is blank exchange contents of that row with ColumnD. Move ColumnD to the end. Select anything before Quote in ColumnF and remove it to ColumnE if that is empty, otherwise to ColumnH. The result should look something like:
-rather better than I was expecting and I’d guess about the limit of what could reasonably be programmed.

Resources