I have the following table
Address, City, Data1, Data2, Data3
123 North 5th Street, San Francisco, A, B, C
123 N 5th Street, San Francisco, [Blank], D, [Blank]
123 North 5th St, San Francisco, E, F, G
I want to merge the data based on 2 criteria: the first 4 digits of the address and the city.
So the merge row would look like:
123 North 5th Street, San Francisco, AE, BDF, CG
I have about 6000 records include the "duplicates". I have the table in both access and excel, any help would be appreciated.
Are you sure you want to carry out the matching on that criteia? Would you for instance want to match the following record with those three above
123 North 4th Street, San Francisco?
The ideal way to do this is to address standardise the data first and then deduplicate it. In NZ for instance we use a PAF file (Postal Address File) to standardise the addressing and accurately issue a DPID (delivery point identifier) to each record. You'd then be in a position to match the data to carry out your second step of merging records (which is still a tricky exercise). There would be plenty of vendors around to facilitate this for you in the US for a small fee, Axciom I believe is a global player in this space.
If you don't want to do that, another option is to use a third party tool to match those records with some fuzzy logic rather than coding it yourself, I've used an addin to Excel before by a company called DQGlobal to run over data and match records.
Related
I have huge sets of data and addresses that run multiple lines. Trying to create a grouping function or tool on excel or alteryx that can transpose and group all of the data in a tabular format without selecting each group individually. Problem is some of the raw data spans multiple rows making a function difficult to create.
e.g
name: belgium holdings
address: 123 european st
paris, France
taxid: 12345
Final result
name address taxid
belgium holdings 123 european st 12345
paris, France
This result, but for almost 50 entries. And if i can consolidate all the tabs into one spreadsheet
I've attached an example below the right is final
I have a spreadsheet that looks like this
State
City
WA
Seattle
WA
Seattle
WA
Yakama
OR
Portland
OR
Albany
NY
Albany
OR
Portland
I want to count the duplicates but only for the times that BOTH columns are the same value, I would like the output to give me this info
State
City
Count
WA
Seattle
3
WA
Yakama
1
OR
Portland
2
OR
Albany
1
NY
Albany
1
I know this should be simple but I am having trouble finding this exact question elsewhere... thanks
You have a couple options.
Solution 1: Formulas
First copy and paste your state and city to new columns, then dedupe them using the Data tab. Then here's the formula for cell F3:
=COUNTIFS(A:A,D3,B:B,E3)
Solution 2: Pivot Table
Create a pivot off your data. Rows would be State and City. Values is Count of Whatever (city for example). Change your design to Tabular, repeat all labels, do not show grand or subtotals.
Just for fun a Microsoft365 solution (assuming you made a typo in your sample data):
=CHOOSE({1,2,3},UNIQUE(A2:B8),INDEX(UNIQUE(A2:B8),0,2),COUNTIFS(A2:A8,INDEX(UNIQUE(A2:B8),0,1),B2:B8,INDEX(UNIQUE(A2:B8),0,2)))
I have a rota listing names (down the side) and dates (across the top) and there is also an additional column for region/area for each name e.g.:
--Name--------Region-------13/12/19--------14/12/19--------15/12/19--------16/12/19-----------17/12/19
John Smith North IN IN OFF IN OFF
Jane Doe North OFF IN IN IN OFF
Bob Newhart South IN IN OFF OFF OFF
I also have a list of jobs completed by each person e.g.:
--Name--------Region-----Job#---CompletedDate-----JobType
John Smith 22 14/12/19 xx
John Smith 23 14/12/19 yy
John Smith 24 16/12/19 zz
Bob Newhart 25 14/12/19 aa
I know how to look up the Region from the name =INDEX(table[Region],MATCH(A2,table[Name],0),0)
and I've even worked out how to look up whether they're in or off based on a 2-way INDEX/MATCH (e.g. =INDEX(Rota!B:B,MATCH(A2,Rota!A:A,0),MATCH(D2,Rota!1:1,0))
My issue is when a person changes region e.g. John Smith moves from North to West:
--Name--------Region-------13/12/19--------14/12/19--------15/12/19--------16/12/19-----------17/12/19
John Smith North IN IN OFF
John Smith West IN OFF
Jane Doe North OFF IN IN IN OFF
Bob Newhart South IN IN OFF OFF OFF
I need to look up what their region was on the day the job was done so I can summarise the jobs done per region separately from the number of jobs done per person.
I'm guessing that it is something like the 2nd INDEX/MATCH above but with some form of "if the cell is blank then move down to the next match of their name in the same column". Does that even make sense?
Any help would be appreciated.
Thanks,
Alan
What about creating a helper column that combines person + region? e.g. you'll have a (hidden) column containing values like John Smith~north (you'll want a separator that properly handles Jane South).
Additionally, I think the first index function you've created can be substituted by a VLOOKUP:
=VLOOKUP(A2,rota!$A:$E, 2, FALSE)
// look for value A2
in the first column of A:E on the Rota sheet, then
copy over the value from the 2nd
Don't approximate matches
The second could be re-created with a simple COLUMN() reference to itself if the dates are entirely congruent between the sheets (i.e. if the dates are the same for each column).
EDIT
Took a second (closer) look - You're kind of pushing Excel to the limits of what it is designed for. You could try and hack around this, e.g. by wrapping the second INDEX function in an IF that checks if the outcome of the index is not empty and if it is, moves the match down one. But this will not tackle an example where John moves between three regions.
Conceptually, you're working on three tables:
1. Name/date with in or out values
2. Name/date with region values
3. Job/date with name values
Because your ROTA table is merging 1 and 2, you are getting multiple rows for one individual, and that is giving you trouble in the third table. Maybe in your design you can split these, which will make it easier to grab them with the functions you use. This starts to look a lot like a database though...
But maybe you add the region in the fields of the ROTA table?
--Name----------13/12/19--------14/12/19--------15/12/19--------16/12/19-----------17/12/19
John Smith IN~N IN~N OFF~N IN~S OFF~S
This means that for the region you can use the function that you'll already used, if you use one appended character in all cases:
=LEFT(INDEX(table[Region],MATCH(A2,table[Name],0),0), LEN(INDEX(table[Region], MATCH(A2,table[Name],0),0)) -2)
=RIGHT(INDEX(table[Region], MATCH(A2,table[Name],0),0), 1)
Or if you want to use the delimiter (~ in the example)
=LEFT(INDEX(table[Region],MATCH(A2,table[Name],0),0), FIND("~", INDEX(table[Region],MATCH(A2,table[Name],0),0) -1)
=RIGHT(INDEX(table[Region],MATCH(A2,table[Name],0),0), FIND("~", INDEX(table[Region],MATCH(A2,table[Name],0),0) -1)
I have 2 large files, an Excel spreadsheet and a csv file, which are messed up, but still need to be uploaded into a table. I'm in progress learning how to use SSIS. Assume the columns and rows look something like this..
1st Excel spreadsheet (file extension .xlxs)...
ID Name GroupName City Time Price Date
A1 South Group1 London 10/06/2018 $4.50 13.30
A2 North Group2 New York $60 10/07/2018 09:00 AM
Fig 1
2nd Excel spreadsheet (file extension .csv)...
ID Name GroupName City Date Time Price
A3 East Group3 Paris 09/09/2017 $5.00 03:00 AM
A4 West Group4 Berlin 01/05/2018 $12.50 18:00
Fig 2
If you look at ID A2 in Fig 1, you will see Date as 9.00 then AM in different column. How do you solve a problem like that? This is an example, so Time data is randomly different in each column. Also note in Fig 2 for A4
I am familiar to a degree with the Script Task and Foreach Loop Container.
I search on the net and found this website....
It's is sort of what I am looking for.
For now a table has been created with these column names
ID, Name, GroupName, City, Date, Time and Price.
So ideally when data is loaded into the table it should look like this...
ID Name GroupName City Date Time Price
A1 South Group1 London 10/06/2018 13.30 $4.50
A2 North Group2 New York 10/07/2018 09:00AM $60
A3 East Group3 Paris 09/09/2017 03:00AM $5.00
A4 West Group4 Berlin 01/05/2018 18:00 $12.50
I am not sure how to approach this.
Please note: I just want to know what SSIS Toolbox Components I need to use. Once I know, I will attempt to solve this problem. That's the reason for no code example.
Thanks in advance.
Update
Thanks Hadi. If nobody mind I will keep this thread open and update when SSIS is fully available in VS 2019 and have the chance to find a solution.
I don't think there is an easy solution for that. But i will try to give some suggestions:
Convert the Excel file into csv file
In the Flat file connection manager only define on column of type DT_STR and length = 4000
In the Data Flow Task add a Script Component to split each line and validate each column value and assign it to the relevant output column
You can refer to the following answers to learn more since it contains helpful information on how to read data from flat file when data is not structured very well (Even if it is not the same case)
SSIS ragged file not recognized CRLF
How to load unstructured flat file with uneven space as delimeter? And also file contain two header
SSIS reading LF as terminator when its set as CRLF
I am working with excel and need some inputs on how we can search for multiple words in a column and then return the position from where the match was found. For example the table whose words i want to check are:
Column A Column B
North Carolina
South Boston
West Coast
East Central
The table i want to check these phrases in is below:
Column C
North West Carolina
Western Coastal
Eastern Time for Central
Southern Boston
The final output should give me something like below:
Column A Column B Column D
North Carolina 1
South Boston 4
West Coast 2
East Central 3
Note that we are searching for words in the 2nd table irrespective of the order in which they are. For example even though the first row in 2nd table is North West Carolina, we get a match. The output basically gives us the position of the phrase where we could match our text.
Can this be done in excel somehow?This seems to me like a combination of match() and search() somehow but i haven't been able to crack it. Can it be done?
I tried the formula listed below but its not working:
VLOOKUP(and($A1&"*",$B1&"*"),'Table2'!$D$2:$D$5,1,FALSE)
But this doesn't work
Thanks
Try this (assuming data starts in row 2)
=MATCH(1,ISNUMBER(SEARCH(A2,C$2:C$5))*ISNUMBER(SEARCH(B2,C$2:C$5)),0)
entered as an array formula using CtrlShiftEnter