Get unstructured data from excel sheets into 1 new file

Get unstructured data from excel sheets into 1 new file - excel

I'm trying to be pointed into the right direction of finding a solution of an interesting request I've gotten recently.
An excel file was given to me, with 1 sheet per day, so the year file has 365 sheets, named based on their date.
Now the interesting and also annoying part is that the sheet contains roughly 15 tables, which are not formatted as tables but only visually. See the example here:
The desired format is this:
TABLE NAME -- NAME -- VALUE1 -- VALUE2 -- VALUE3 -- SHEETNAME
Luckily the source format is the same on every sheet. My question is, does anyone know a good method to create a new excel file that takes all this data and combines it into 1 sheet. Which software? Language etc.
So essentially it would be saying, use cell/row X5&&cell/row Y4 as column 1, cell/rowxxx as column 2 etc. (from all sheets available, combined)
Then what I'd want is to import said data and have it transformed/loaded into 1 big new table as described. I previously used pandas dataframe and tabular to merge PDF tables into 1 but these were already actual tables of itself and thus easier. These are basically just cells, visually shown as tables making this quite a nightmare.
Would highly appreciate any creative ideas.

Related

Excel design consideration - infrequent change and impact for VBA/formula/add-on

nothing stuck or broken, just I am inspired after a discussion with another Excel author.
His situation:
Read from an existing Excel monster file (column FG), and hard-coded the following
Range("FF:FG").Copy
Potential issue:
data in FF:FG will be pushed to GF:GG every couple of months because newer columns will be inserted in between. (It's a pivot-like design... sorry, but end-users need this appearance, but categories are increasing, summary need to be at right end side)
He has 2 other choices (if he don't want to maintain VBA code every few months):
A: Store "FF","FG" in a Cell (fixed location!), then read the location parameter using VBA
B: Read a second dedicated CSV file (copy/paste from the monster file, consumed by another user so available readily), it only has the 2 columns required..
To me, none is obviously better than the other, just a matter of preference.
Similar but simpler Scene of mine
I produce the monstrous file by lots of Vlookup from manual data sources (inherited the design... and I refactored the design using another automation tool but there is license consideration atm).
In a column there is a formula doing something like
=if(A1="SALES PERSON SICK","void result",(if(A1="MACHINE BROKEN",C2*0.8),"")..
say 5000 rows with this formula
To reduce hard-coding I moved
"SALES PERSON SICK","MACHINE BROKEN" to a reference sheet cell A1,A2, and changed formula to:
=if(A1=Ref!$A$1,"void result",(if(A1=Ref!$A$2,C2*0.8),"")
I feel it's a good practice.
Question: Is method A or B better? Considering column position will move every ~3-6 months, still worth choosing 1 from A/B?

data in FF:FG will be pushed to GF:GG every couple of months because newer columns will be inserted in between
Then you should use named ranges in what you call "monster file" (see Define and use names in formulas) and use them in your VBA.
Eg define a name for Columns FF:FG like CopySource (use a name that describes the data in that columns) and finally you can use that in your VBA code.
Range("CopySource").Copy
Whenever the range moves because new columns are inseted before, the named range moves too, so it still points to the same data.

Is there any other way to parse the Excel file with irregular tables?

I used to use pandas to parse the Excel file and it worked pretty well when the data follows a table format. But recently I got a new data look like this:
When I use pandas to read the Excel file, it would read the entire spreadsheet instead of the tables (week by week). My idea now is to reorganize the tables.
For example, when I read the column B from row 10 to row 25, if I encounter the value equals to "% Rejection", then it will move right to read the percentage of each day (for seven times) and create a new table I want.
However, it feels like not quite efficient. Therefore, I'm curious if there is any other way to parse the data. Any recommendation would be great. Thank you.
Edit:
I wonder if I can parse the Excel file to a table looks like this:

Excel - If + Index Match + Offset -- VBA or something else?

I made a dummy version (fake names and extremely shortened) of the 2 spreadsheets I'm working with this spread sheets
Background:
I'm automating the data between Contracts and our Accounting Team's template. Note that neither of these spreadsheet's formats can budge so that why I'm stuck. It's a clunky process that I am trying to automate. The main source of data is the "Contracts" tab. Let's say out of the 300 subcontractors projects, in the week of 1/24/2019 my coworker approved 130 of the projects. The logic of what I am trying to accomplish:
In the Contracts tab, if Column R is "Yes"--
In the "Accounting Template" tab (the one with formulas) Column B, pull all the cells of Contracts!A of the vendors we are set to pay.
The same applies to Template! (a nickname for the path) Column M, pull the specific contract ID's of the approved Contract ID's from Contracts!C.
Note I intentionally showed that my fake Puppies program is NOT approved to get their payment, this will help demonstrate how to resolve my issue
My key issue is that the Accounting Template skips every 3 rows for the Project, and the Contracts row has Project day every single row. So, for Template!A5, I am pulling data from Contracts!A2, and Template!A8 I am pulling data from Contracts!A3, etc.
I was able to (sort of) make this work with an offset, row and index match:
=OFFSET(INDEX(Contracts!$C$2:$C$167,MATCH(ROWS(Contracts!$A$2:A17),Contracts!$AB$2:$AB$167,0)),-10,0)
See that negative -10? For each new 3rd row I am starting at template, I'm manually changing it to -10, -12, -14, etc etc. Not exactly sophisticated.
Looking at how offset and row work, it looks as though they heavily rely on the coordinates of cells in the Contracts workbook. However, I ideally am looking to do this:
=IF(Contracts!R2="Yes",OFFSET(INDEX(Contracts!$C$2:$C$167,MATCH(ROWS(Contracts!$A$2:A5),Contracts!$AB$2:$AB$167,0)),-2,0))
However, once I throw a conditional (IF) in the mix, that reorientating the rows of my offset match. Are there better formulas for what I am trying to accomplish? A VBA script that could accomplish this IF, INDEX, MATCH, OFFSET, ROW dream of mine? I'm not married to either of these formulas.
I've perused a few VBAs but nothing seems to have a conditional like IF as a component.
EDIT:
Per a request, adding screenshots. There's also a Google Sheet link:
Contracts tab, purposely hiding irrelevant columns:
Accounting Template Tab:

I'd do it with VBA, but this formula example might help you start. If your issue is basically turning horizontal data into vertical data and you have a fixed interval of 3 rows. You will need to adapt the formulae for your actual set up.
The formulas used are:
F1 and down =IF(MOD(ROW()-1,3)=0,INDEX($A$1:$A$3,(ROW()+2)/3),"")
G1 and down =INDEX($B$1:$D$3,CEILING(ROW(),3)/3,1+MOD(ROWS($G$1:G1)-1,3))
I'm sure there are better ways ...

Managing data sets in SPSS where multiple cases appear in one row

I'm working with a data set which has details on multiple people on one row. How I've dealt with this is to have variables like this:
P1Name P1Age P1Gender P1Ethnicity P2Name P2Age P2Gender... etc
This makes analysis very difficult. I have used multiple response variables which are good frequencies, but its unweildy, takes time to write out the syntax (there's a lot of 'p's) and you can't do other analysis with it.
first of all is there a way to run analyses as if all the name, age, gender and so on variables are all on the same row? (if that makes sense) To do this all I can think of doing is pasting the data into Excel and then cutting and pasting to get them all into the same columns, then pasting back to SPSS. Any other ideas?
Or is this just a matter of having two datasets, one for the case details and one for the people details?
Any advice would be greatly appreciated!

Write the data out using SAVE TRANSLATE and then read it back in removing the P's, like this:
FILE HANDLE MyFile /NAME='/Users/rick/tmp/test.csv'.
DATA LIST FREE /p1x1 p1x2 p1x3 p2x1 p2x2 p3x3.
BEGIN DATA.
1 2 3 1 2 3 1 2 3
END DATA.
LIST.
SAVE TRANSLATE
/OUTFILE=MyFile
/TYPE=CSV /ENCODING='UTF8' /REPLACE
/CELLS=VALUES.
DATA LIST FREE FILE=MyFile /x1 x2 x3.
LIST.
That should do it.

Excel Lookup with multiple queries

I have a question that I a may not be thinking correctly about. But I have an a long excel file that I pull from somewhere else with the following columns:
Project_Name1, Employee_Name1, Date_Worked1, Hours_Worked1
In another sheet I have these columns
Project_Name2, Employee_Name2, Begin_Date2, End_Date2, Hours_Worked2
This second sheet is filled with data, and works just fine.
However, it turns out that I have some employee names that I do not know that are also working on the same project. I need to figure out the names of the employees and then sum the number of hours they worked for a given period.
So I need a lookup with three criteria:
Project_Name1 = Project_Name2
Employee_Name1 <> {Array of Employee_Name2}
Begin_Date2 <= Date_Worked1 > End_Date2
Returning Employee name.
Once I have the employee name, I can do a sumifs=() and get the total hours they worked no problem.
I have tried a number of combinations of Index Match functions, using ctrl-shift-enter... and have not been able to figure out it. Any help would be greatly appreciated.

What you're talking about doing is extremely complicated and a little bit past what Excel was designed to do by default. However, there are a few workarounds that you can use to attempt to get the information that you're looking for.
It's possible to do multiple-criteria VLOOKUPs and SUMIFs by concatenating fields to make a multi-part identifier (Ex: Insert a new column and have a forumla in it like =A1&B1)
Open a new workbook and use Microsoft Query (I'm not sure if you can select from more than one sheet, but if you can select from multiple sheets like tables you should be able to write a semi-complex query to pull the dataset you want.
http://office.microsoft.com/en-us/excel-help/use-microsoft-query-to-retrieve-external-data-HA010099664.aspx
Use the embedded macro feature and use visual basic script to write out your business logic. (Hotkey is ALT+F11)

One way to do this would be to first create an additional column to the right of entries on the sheet you're trying to pull employee_name from: =ROW()
You could then use an array formula like you were trying to implement to pull the corresponding 'match' row:
{=SUM((project_name1=projectname2)*(employeename1<>employeename2)*(begindate<=date_worked1)*(date_worked1>end_date2)*(match_column))}
You could then use this returned match_column entry within the index as you described to retrieve the appropriate entries.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string