Extracting data from a non-formatted string - string

I want to extract certain parts and be able to put it into a nice spreadsheet format. The important parts are the address, ward number, square feet, and price. I was going to try something really complicated in PHP(novice), but thought there might be an easier way.
The data looks like this:
243-467
1402 E. Mt. Pleasant Ave. 50th Ward approximately 1,416 sq. ft. more or less BRT# 502440300 Improvements: Residential Dwelling
JANET DENNIS C.P. October Term, 2007 No. 01082 $105,641.01 Morton R. Branzburg, Esq.
244-712A
5407 Chestnut St. - Premise A 60th Ward Apt 2-4 Unts 2 sty Masonry; Improvement Area 4,610 sq. ft. BRT# 603011200 Improvements: Residential Dwelling
ALEXANDER TALMADGE, JR. (WHO HAS 1/3 INTEREST), BERNADINE ABAD AND BERNARD BLAIR TALMADGE $32,153.00 Drew Salaman, Esq.

Where does the data come from? Can you modify its output? If so, try outputting CSV text (http://en.wikipedia.org/wiki/Comma-separated_values). Excel will import CSV files.

Related

Excel - highlight cells if a percentage of the text is the same

I have a single column spreadsheet with thousands of rows of business names.
There are many, many duplicates which are easy enough to find and purge. However are even more partial matches. For example, there is a law firm that I have seen listed multiple ways -
Jones, Smith and Paul
Jones, Smith, and Paul
Jones, Smith & Paul
Jones, Smith & Paul, LLC
Jones, Smith, Johnson & Paul, LLC
There are so many variations in these business names. The idea I keep coming back to is creating a formula to highlight cells than contain X% of same text/characters. That way, I could make a few passes, say one pass with 50% matching text, 75% matching text, etc.
I have been scouring the google and while I come across many posts asking similar things, I haven't come across anything that will solve this for me.
For anyone who comes across this in the future, Microsoft made an Excel plugin called Fuzzy Lookup that accomplishes exactly this partial/percentage based matching.

Extract dates in various formats from free text string in excel

I'm struggling with extracting dates from a free text field in Excel where the date could be in any number of formats due to human input.
Some examples of the entries are below but essentially they could be 30/6, 30/06, 30th June, 30/6/21, 30/06/21, 30/06/2021, 30-6, 30th, 30-06, tomorrow!
Alright; I probably can't do much about that last one and I can pull the date from the others if I know the format but I'm looking for something that would handle any and all of the permutations.
Example Data
Column A
Empire still haven't repaired the roof. Waiting on Empire to waterproof the roof then I can patch up the ceiling in 611. ETA 30/06/2021
ETA 11/6/21
floor boards need replaced
22/6/21 awaiting parts
have sanded the filler that was in the war and refilled ready for or sanding again
new bathroom floor to replace. need to source materials and have time to do job. full bathroom sub floor to replace. ETA 29/06
Engineer attended and found E122 error, reset and tested but fault still present. New EVA probe required. Part to be sent to site.
new toilet handle ordered. ETA 28th June
TV on order
refer to Colin Warner for further works and required parts
28/6 awaiting underlay
leak in room 315 fixed. room 215 will need a week to dry. ETA 30th June.
10 July
ceiling too wet to carry out any work. eta 5-07-2021
I know it's a big ask but if you know of a formula or VBA UDF that can handle all of those then I'd be eternally grateful.

variable length substring of a string with specific beginning and ending

I have to extract a substring which is always variable in length from within the middle of a string (cell) in excel.
The criteria is:
it is always starting with a specific set of symbols (in this example "Ingredients:")
it is always ending with a specific set of symbols (in this example "Table of Nutritional Information").
The length can be any from 1 word to about 500.
It could be an excel formula or even VBA. But I am a complete beginner with VBA, so please give specific advice there.
My example cell content is like this:
We could tell you that our Beanz are hard to beat. That they're brimming with deliciously rich, tomatoey flavour. But you already know that. Because you know what Beanz Meanz...
Heinz baked beans don't just taste great, but are nutritious too; high in fibre, high in protein and low in fat, as well as contributing to 1 of your 5 a day. Packed full of quality ingredients... it has to be Heinz. Love our Heinz Beanz as much as we do? Discover the rest of our range, including organic and no added sugar varieties!
Heinz Beanz come in a variety of multipacks, perfect for when you need to feed the whole family!
1 of your 5 a day.
No artificial colours, flavours or preservatives.
Suitable for Vegetarians and Vegans.
Naturally high in protein and fibre.
Gluten free and low in fat.
Ingredients:
Beans (51%), Tomatoes (34%), GRAIN, Water, Sugar, Spirit Vinegar, Modified Corn Flour, Salt, Spice Extracts, Herb Extract.
Suitable for Vegetarians. Free From Artificial Flavours.
Empty unused contents into a suitable covered container. Keep refrigerated and use within 2 days.
Table of Nutritional Information
Per 100g Per 1/2 can %RI*
Energy 329kJ 682kJ -
78kcal 162kcal 8%
Fat 0.2g 0.4g 1%
- of which saturates <0.1g <0.1g <1%
Carbohydrate 12.5g 25.9g 10%
- of which sugars 4.7g 9.8g 11%
Fibre 3.7g 7.7g -
Protein 4.7g 9.7g 19%
Salt 0.6g 1.2g 21%
*RI per serving. Reference intake of an average adult (8400kJ/2000kcal)
The desired outcome would be:
Ingredients:
Beans (51%), Tomatoes (34%), GRAIN, Water, Sugar, Spirit Vinegar, Modified Corn Flour, Salt, Spice Extracts, Herb Extract.
Suitable for Vegetarians. Free From Artificial Flavours.
Empty unused contents into a suitable covered container. Keep refrigerated and use within 2 days.
Let's say your example cell is A1, then in another cell you can do:
=TRIM(MID(A1;SEARCH("Ingredients:";A1);SEARCH("Table of Nutritional Information";A1)-SEARCH("Ingredients:";A1)))
You will probably will have to adapt a little bit to get rid of final breaklines.
This is how it works:
SEARCH("Ingredients:";A1) will find the position of the first coincidence of text Ingredientes. returning a number. This will be starting point of extracting text with MID.
SEARCH("Table of Nutritional Information";A1) same than before, but with text Table of Nutritional Information. So this is the end point of extracting text
Step 2 - Step 1 will return how many chars you want to extract, starting at Step 1.
TRIM will just delete extra blanks if added. Notice that extra blanks are not the same than breaklines.
In this case, to get rid of final BREAKLINES, just do extra -5:
=TRIM(MID(A1;SEARCH("Ingredients:";A1);SEARCH("Table of Nutritional Information";A1)-5-SEARCH("Ingredients:";A1)))
This will return the exact output you want, but don't know if it will work with all your inputs.
Assume source data housed in column A, put criteria header "Ingredients" & "Table of Nutritional Information" in B1 and C1.
Then,
In B2, formula copied down :
=MID(LEFT($A2,FIND(C$1,$A2)-1),FIND(B$1,$A2)+LEN(B$1)+1,599)

get new list from pivot_table?

I'm using python 3 and pandas. I have this pivot table below.
>>> print(p.head())
question_id 1 2 3 4 ... 26 27 28 29
assessment_attempt_id
...
243908 21-24 Female 4th year undergraduate White ... Disagree Disagree Agree Agree
290934 25-29 Male Prefer Not to Answer Black or African American ... Neutral Neutral Neutral Neutral
312457 18-20 Female 1st year undergraduate White ... Strongly Agree Strongly Agree Strongly Agree Strongly Agree
312766 18-20 Female 2nd year undergraduate Hispanic or Latina/o ... Agree Agree Agree Agree
312786 21-24 Female 4th year undergraduate Black or African American ... Strongly Disagree Agree Agree Agree
It is produced from this command:
p= pandas.pivot_table(df, index=["assessment_attempt_id"], columns=["question_id"], values="text", aggfunc='first')
The table is basically exactly what I want. Now I just want columns 1,2,3,4 and the assessment_attempt_id column in a Datatable, so I can join that data by assessment_attempt_id with another existing data table.
Normally I would subset the data by doing something like this:
df1 = df[['a','b']]
but that produces and error:
KeyError: "['1' '2'] not in index"
It seems like this should be a simple and solved problem but I can not find the answer. I also tried a groupby variation which produced the same output, and also I could not extract the columns I wanted. I assume I can't at the multi-index, but I don't know how. Thank you.

How to create tabular output in python

Currently, I'm looking to scrape the signatures table from the edgar filings for specific companies. I have created a Python program to get down into each document and finds the tables that I need to scrape. I'm having trouble figuring out how to output the data to a file in a 'pretty' way.
Here's a link for a bit of a visual (just scroll to the bottom of the document, there will be a page of signatures there):
Example Document
What I'm looking to do is format the table, the same way it is formatted on the website, with each cell taking up a specific amount of space, and filling in unused space with... well, spaces!
My current output:
|Signature, Date, Title|
|/s/ Stanley M. Kuriyama| Chairman of the Board, February 29th, 2016|
|Stanley M. Kuriyama|
|/s/ Christopher J. Benjamin, President, Chief Executive, February 29th, 2016|
|Christopher J. Benjamin, Officer and Director (and so on...)|
|-----------------------------------------------------------------------------|
What I'm looking to do (periods are spaces):
|Signature......................,Title......................,Date...............|
|/s/ Stanley M. Kuriyama,.......,Chairman of the Board,.....,February 29th, 2016|
|Stanley M. Kuriyama............................................................|
|/s/ Christopher J. Benjamin....,President, Chief Executive,February 29th,2016..|
|Christopher J. Benjamin,.......,Officer and Director (and so on...)............|
|-------------------------------------------------------------------------------|
Is there any way to print out the string plus (maxSize -stringSize) number of spaces per cell, so the data looks more tabular? I'm looking to do this with the vanilla Python3, not additional downloads because the people using this program may not be as tech savvy as I am.

Resources