I have a lot of excel data spread accross 32 documents in an identical format that list information. The combined total would be over 6 million rows.
I have another document that contains a few thousand rows. The CONCATENTATE of Column C,E and L in this new document could be the same as the CONCATENATE of column D, F and N in any of the other 32 documents.
I want to find information that is the same and grab the hole lot of it for each row in the small document from each of the larger documents.
At the moment this required that I concatenate the info on each of the larger documents, remove all spaces and punctuation and use 32 IFERROR calculation each containing a VLOOKUP. The last one took all night. All others have crashed the computer.
The must be a better way of doing this???
EG.
Small document:
TITLE1 | TITLE2 | TITLE3
Larger documents (all 32)
FACT1 | FACT2 | FACT3 | TITLE1 | TITLE2 | FACT4 | TITLE3
If the concatenation of Title 1,2 and 3 of the small document finds the same info in any of the concatentations of title 1,2 and 3 (removing all spaces and punctuation) from the larger document. I want to copy in all the information for that row from the larger document including the titles and facts adjacent to the row of info in the smaller document.
Ai yai yai. Excel is not made for doing something like this. It's simply the wrong tool. So assuming that you are stuck with that, I would try creating an Access database, linking each spreadsheet you need, then writing a query. I'm not totally clear on what you want to do with the matched info, but you could export it to a new spreadsheet, or link a spreadsheet to that query.
In Access (2007+), go to the External Data tab, click on Excel in the Import section, then select Link. If that is still too slow, you will need to copy the spreadsheets in and perform the query.
Would be much better in a database rather than Excel, but to make it work efficiently in Excel you need to use the binary search option (sorted approximate match) of VLOOKUP or MATCH. This is several orders of magnitude faster than linear (unsorted) search:
1. Create additional columns doing the concatenation etc on the 32 sheets and the small sheet.
2. Sort the data on the 32 sheets using the concatenated column
3. Use a Double VLOOKUP with IF to turn the Approximate match into an exact match, something like this
=IF(VLOOKUP(PartNumber,PartsList,1,TRUE)=PartNumber, VLOOKUP(PartNumber,PartsList,4,TRUE), “Missing”)
See http://fastexcel.wordpress.com/2012/03/29/vlookup-tricks-why-2-vlookups-are-better-than-1-vlookup/
for a more detailed explanation of this formula.
If there is a structure so you know in which sheet you need to search, you could see if the INDIRECT function can help you. Otherwise, I suggest you import the data in a database (such as Access) and then query your data.
Related
I have some text which I receive daily that I need to seperate. I have hundreds of lines similar to the extract below:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
I need to extract individual snippets from this text, so for each in a seperate cell, I the result needs to be the date, month, company, size, and price. In the case, the result would be:
FEB50-40
APR
COMPANY A
100
0.40
The issue I'm struggling with is uniformity. For example one line might have FEB50-FEB40, another FEB5-FEB40, or FEB50-FEB4. Another example giving me difficult is that some rows might have 'COMPANY A' and the other 'COMPANYA' (one word instead of two).
Any ideas? I've been trying combinations of the below but I'm not able to have uniform results.
=TRIM(MID(SUBSTITUTE($D7," ",REPT(" ",LEN($D7))), (5)*LEN($D7)+1,LEN($D7)))
=MID($D7,20,21-10)
=TRIM(RIGHT(SUBSTITUTE($D6,"$",REPT("$",2)),4))
Sometimes I get
FEB40-50(' OR 'FEB40-FEB5'
when it should be
'FEB40-FEB50'`
Thank you to who is able to help.
You might get to the limits of formulas with this scenario, but with Power Query you can still work.
As I see it, you want to apply the following logic to extract text from this string:
COMMODITY PRICE DIFFERENTIAL: FEB50-FEB40 (APR): COMPANY A OFFERS 1000KB AT $0.40
text after the first : and before the first (
text between the brackets
text after the word OFFERS and before AT
text after 'AT`
These can be easily translated into several "Split" scenarios inside Power Query.
split by custom delimiter : - that's colon and space - for each ocurrence
remove first column
Split new first column by ( - that's space and bracket - for leftmost
Replace ) with nothing in second column
Split third column by delimiter OFFERS
split new fourth column by delimiter AT
The screenshot shows the input data and the result in the Power Query editor after renaming the columns and before loading the query into the worksheet.
Once you have loaded the query, you can add / remove data in the input table and simply refresh the query to get your results. No formulas, just clicking ribbon commands.
You can take this further by removing the "KB" from the column, convert it to a number, divide it by 100. Your business processing logic will drive what you want to do. Just take it one step at a time.
I have a series of data (in 2-dimensional list 'CombinedTable') I need to use to populate a table in an MS Word template. The table has 7 columns so I attempted the following using docxtpl module:
context = {
'tpl_modules1': CombinedTable[0]
'tpl_modules2': CombinedTable[2]
'tpl_modules3': CombinedTable[4]
'tpl_modules4': CombinedTable[6]
'tpl_modules5': CombinedTable[8]
'tpl_modules6': CombinedTable[10]
'tpl_modules7': CombinedTable[12]
}
tpl.render(context)
tpl.save(FilePath + FileName)
Not the most elegant solution I know but am just trying to get this working- unfortunately using this code with the following template results in tpl_modules7 data being written in to all columns, rather than just the 7th.
Does anyone have advice for how to resolve this? I attempted to create a for loop through the columns as well as rows but was unsuccessful in writing anything to the doc (was saved as a blank & empty doc).
The CombinedTable variable is a list of 12 lists (one for each column in template, although only 7 contain data). Each of these 12 lists contains another list with cell data whose length is equal to the number of rows to be written to the table in that column. This means that the number of rows that are written to varies for each column.
EDIT: Looking more closely at the docs, it states that I cannot use %tr multiple times in the same row. I assume I will then have to use a loop through %tc and %tr (which I tried & couldn't get working). Any advice on how to implement this? Especially on the side of the word document. Thanks!
I was able to resolve this satisfactorily for my requirements, however my solution may not suit all. I simply set up 7 different tables in a document with 7 columns and adjusted margins/borders to suit the dimensions I required for the tables. Each of the 7 tables had identical docxtpl syntax as image in my question with the small buffer columns between them being replaced by columns in the word document.
Im trying to get the rows of an excel document. What i have achieved.
1-. Retrieve .xls, .xlsx files
2-. Convert those files to TIFF images
3-. Enhance image for better text recognition
4-. Identify Pages
5-. Create the Documents
6-. Recognize Page and Fields
7-. Populate Fields (this is were is my problem)
For example, in a table like
Name | Age | Size
Juan | 26 | 1.90m
Max | 25 | 1.85m
Victor | 26 | 1.65m
My project can find the keyword Name, Age & Size, and in the settings i can tell him, ok the value is down a line and group the leading and trailing words, but it will only fill the fields name, age and size with the first values below and will ignore the others, and datacap does not seems to have a field array type.
In the image, you can see that there is only one way add fields, and they are scalar (just one value), Add multiple only adds multiple fields at once, not a field of multiple values haha.
This is how my fields get retrieved
Another problem i face is that my excel sheet gets splitted in order to fill a document format, and i was expecting the whole sheet to be converted in 1 document not 4
In the image, those 4 pages are from the same sheet (in the excel)
IBM docs still lacks information, there are some pages that only has its title and zero information lol.
agreed for point 1, it does not support any field like array or something which is more of a advanced level. This feature is really needed and we may see something from IBM going ahead.
Coming back to second point, datacap will be converting the excel according to the print pages like when you print that excel. you have to add the ruleset to merge those in single file.. The most common way to do that is to use tiffmerge ootb given by datacap.
Situation:
I am pulling information from a database and exporting it into an Excel 2010 template. The data consists of unique IDs (numeric), dates, and text in their respective columns. When going to sort, Excel usually recognizes the unique IDs as text and gives me the option of 'A-Z' which yields the correct result.
Problem:
Occasionally when sorting the unique IDs, Excel will give me the option to sort from 'Smallest to Largest' and when this happens the report yields a wildly incorrect result.
Pattern:
The sorting criteria is the only common denominator when a report fails, which makes little sense as they are both ascending orders. This issue only occurs ~20% of the time. The other times it sorts correctly from 'A-Z' as it does in the other worksheets within the same template.
-I've tried changing Number Format within the drop down to 'Text' 'General' and 'Numbers'
-I've tried manually sorting the data through filters as opposed to sort hierarchies
-I've tried clearing the table, and re-copying/pasting the data into the template's worksheet. This seems to work, but as the end goal is automation, I'd like to find out what the root cause is.
Expected result: Numeric data copied and pasted into the field to be sorted from 'A-Z', resulting in a successful report.
Actual result: Numeric data copied and pasted into the field typically results in the sort option of "A-Z', but occasionally sorts from 'Smallest to Largest' resulting in a failed report.
Excel is designed for numbers - and is generally very helpful in coercing text to numbers where appropriate. However, once in Number format the reverse is not easy. As you have discovered, merely choosing Text as format is not enough.
A clue is whether or not (assuming activated) the cells show green triangles.
Other than starting afresh with data entry into a cell already formatted as Text, the conventional solution for conversion with code is to prepend a quote, though appending a space would also serve.
Other than that, the easiest mass conversion approach may be to copy into Word (Keep text only) and copy back to Excel with pasting as Text.
The better solution may be to store IDs as text and prepend 0s to a standard length.
I have a large data set with 4 columns of interest all containing text, namely pokemon moves. The columns "move 1" through to "Move 4" each contain a different move, and each row differs in the combination.
eg.
" A | B | C | D | E".
" 1 Pokemon | Move 1 | Move 2 | Move 3 | Move 4".
" 2 Igglybuff | Tackle | Tailwhip | Sing | Attract".
" 3 Wooper | Growl | Tackle | Rain Dance| Dig".
~ 1000 more
My issue is this:
I wish to filter this data set for rows (pokemon) containing a certain combination of moves from a list.
eg. I want to find which pokemon have both "Growl" and "Tackle". These moves can appear in any of Moves 1 to 4 (aka order of the moves is unimportant)
How would I go about filtering for such a result. I have similar situations in which I would want to search for a combination of 3 or 4 moves, the specific order of which is not important, or also search for specific pokemon possessing a specific combination of moves.
I've attempted to use functions such as COUNTIF without avail.
Help / Ideas are much appreciated
There are a number of options for advanced filtering in excel that you might consider:
Option 1 - Advanced Filters
Advanced filters give you the power to query over multiple criteria (which is what you need). You can also easily do it as many times as you want to generate the final datasets using each filter. Here is a link to the advanced filter section for Microsoft Excel 2010, which is virtually identical here to 2007. It would be a great place to start if you want to move outside of just using basic formulas.
If you do go down this route, then follow the directions on the site in terms of steps:
Insert the various criteria that you have selected in the top rows in your spreadsheet and specify those rows in the list range
Set the criteria range to the place holding all your data on a single worksheet
Run the filter and look at the resulting data. You can easily do a count on the number of records in that reduced data set.
Option 2 - Pivot Tables
Another option that you might look at here would be to use Pivot tables. Pivot tables and pivot charts are just phenomenal tools that I use in the workplace every day to accomplish exactly what you are looking for.
Option 3 - Using Visual Basic
As a third option, you could try using visual basic code to write a solution. This would give you perfect control as you could specify exactly the ranges to look at for each of the conditions. Unfortunately, you would need to understand VB code in order to use this solution. There are some excellent online resources available that can help with this.
=COUNT(INDEX(MATCH(B2:E2, MoveList, 0), 0)) > 0
will return TRUE if any of the values in the range B2:E2 (Moves 1 through 4) are in the range defined by Move List. You want to use a named range so that you can easily copy this formula down for all of your thousand rows.
If you remove the last part that checks whether the COUNT() value is greater than zero, you get:
=COUNT(INDEX(MATCH(B2:E2, MoveList, 0), 0))
which will return the number of moves that the Pokemon has that match a move on the move list.
MATCH() takes three arguments: a lookup value, the lookup range, and the match type. I don't fully understand why, but wrapping that part of the formula in INDEX() seems to let you use an array for the first argument. Maybe someone here can provide a better explanation.
In any case, the formulae above do appear to solve the problem.
Finally, if you're only checking for a few moves, instead of using a confusing formula and a named range as above, you could just make a column for each move that you want to check for, e.g. "Has Growl?" and "Has Tackle?". You would then just use =COUNTIF(B2:E2, "Tackle") and =COUNTIF(B2:E2, "Growl"). You could then make another column that sums these columns and filter out the zero values to display only Pokemon who have Tackle or Growl.
I looked at these two pages when researching how to accomplish this:
https://www.excelforum.com/excel-general/786407-find-if-any-value-on-one-list-exists-on-another.html
https://www.deskbright.com/excel/using-index-match/