Parsing url,hashtags out of twitter text - excel

I have already extracted all the tweets in csv file, I want to seperate twitter text from hashtags and urls, so far I have serarated the hashtags in excel using
Data -> Text to Column
First I don't know how to separate urls using this method
Second is there a better way to do that? All the online links are separating both things at the time of scrapping
TEXT
Learned a new concept today : metamorphic testing. http:/t.co/0is1IUs3aW
variant identification in pooled DNA using R http:/t.co/4PQfUaU
Meta-All: a system for managing metabolic pathway information http:/t.co/2PfJXUxq2X
Here is what it should look like
TEXT URL
Learned a new concept today : metamorphic testing. http:/t.co/0is1IUs3aW
variant identification in pooled DNA using R http:/t.co/4PQfUaU
Meta-All: a system for managing metabolic pathway information http:/t.co/2PfJXUxq2X
Right now both the text and url are in one column I want to put them in different columns

Extract the URL from A2: =MID(A2,FIND("http",A2),500)
The rest from A2: =MID(A2,1,FIND("http",A2)-1)

I would use a simple set of formulas.
=find()
=left()
=Right()
Here are the formula's I used
Here are the results of those formulas
Basically, the find() formula allows you to find where the ""Http:" is in your string. Left() allows you to print() everything to the left of that. Right() lets you get everything to the right.

Related

How to retrieve number from string between "o=" and "&" characters in Ecxel?

I was surfing the net, looking for a solution on how to retrieve the number that I need from a string in Excel.
So I have this kinda string:
"somecharacterso=3242&morecharacters"
and I am trying to retrieve the "3242" number according to this https://www.ablebits.com/office-addins-blog/excel-regex-formulas/#functions
and stuck with this RegExp: o=\b(\d+)\b& but it extracts the full substring not the number only.
So, now tested and improved:
EDIT: based on comment to use "o=" and "&":
=MID(A1,FIND("o=",A1,1)+2,(FIND("&",A1,1)-FIND("o=",A1,1)-2))*1
Does exactly as asked.
=mid(A1,find("=",A1,1)+1,find("&",A1,1)-(find("=",A1,1)+1))*1
returns the 3242 as a number.
There's good style in Excel formulae as well as in programming languages, particularly for making the logic easily understandable to other people who use the file later.
So if the starting data is in A2:
In B2 put =FIND("o=",A2)
In C2 put =FIND("&",A2)
In D2 put =MID(A2,B2+2,C2-B2-2) * 1
Exactly the same logic as Solar Mike's answer, but for most people easier to follow, check, and amend if necessary in the future.
Formulae using native Excel functions (especially shortish ones) are understood by far more people than regex.

How to filter using multiple keywords in google sheets or excel

I'm using google sheets, and I've been trying to filter data based on if the B value contains any of multiple keywords. I'm trying to sort account data, and the names aren't consistent, so I can't just say =FILTER(C:C,(B:B="BK's Stuff")+(B:B="Book")). I need something that will take information out of a lot of text like a wild card. What works great for a single entry is:
=FILTER(C:C,SEARCH("BK",B:B))
But I can't figure out how to combine it so it will filter all values that contain EITHER "BK" or "Book."
Thanks in advance.
You can do it replacing SEARCH through a combination of REGEXMATCH and ARRAYFORMULA
REGEXMATCH allows you to search for multiple keywords separated by |
Sample:
=FILTER(C:C,REGEXMATCH(B:B,"BK|book")=TRUE)
Note:
Regexp is case sensitive, so you need to specify separately
REGEXMATCH(B:B,"BK|bk|Bk|bK|") etc.
This is for Excel:
You can combine several SEARCH()s as follows:
=FILTER(C1:C20,ISNUMBER(SEARCH("Book",B1:B20,1))+ISNUMBER(SEARCH("BK",B1:B20,1)))
(should be similar for Google Sheets)

How can I find text between two headings from docx in python

I want to extract information from the resume, for this, I have to identify headings and take text data underneath that heading.
I think you need to be more specific to your issue and approach you want to take. As of now, for heading extraction, you can define a corpus first form all the headings after reading in beautiful soup. Once such corpus is created you can now match the corpus with heading of the resume and get the section by defining the starting and ending data point. and then match skills et. whatever you want to do with it.
This is the simplest approach based on your current question. Be more specific so, i can guide with more precise approach.
Best,

excel vba Delete entire row if cell contains the GREP search

I have a single column of text in Excel that is to be used for translating into foreign languages. The text is automatically generated from an InDesign File. I would like to clean it up for the translator by removing rows that simply contain a number ("20", 34.5" etc), or if they contain a measurement "5mm", "3.5 µm", etc. I've found many posts (see link below) on how to remove a row with specific string, but none that use search strings, such as those I typically use with GREP searches: "\d+" and "\d.\d µm"
How would I do this? I am on Mac iOS if that helps.
Note that I would need to delete the row if the cell only contains a number or a measurement, not if the number is contained within a phrase, sentence, or paragraph, etc.
https://stackoverflow.com/a/30569969
It may not be what you are looking for, but how about just sorting the column and remove the rows starting with numbers? It is a manual approach but from what I understand this translation process only happens from time to time. Am I right?
I see two possible issues in your question:
How to work with regular expressions in Excel?
How to delete rows in a loop?
Let me start with the second question: when you want to create a for-loop in order to remove items from a list, you MUST start at the end and go back to the beginning (it's a beginner's trick, but a lot of people trip over it.
About the first question: this is a very useful post about this subject, it's too large to even give a summary here.

place excel or access data into category based on text search

very new to VBA/excel/access programming. I've been getting more and more into Tableau, but for what I need right now Tableau is fairly restrictive.
I have a database of retail locations...there are about 42,000 rows of data going back ten years or so. What I'm trying to do is create some code that can do a text search of the store title (say WalMart) and in a blank row assign it to a category like Dept Store, Restaurant, etc.
The problem lies in the store titles...there is no consistency. For example WalMart could be Wal-Mart Wal Mart Walmart Store # 2739 or any other iteration that you can think of.
In Tableau I've been using a command that goes like if Title contains "Wal" and Title contains "Mart" then retailtype="Discount Chain"
This has worked great, but I'm restricted to the number of lines I can include in a calc.
Any help/advice on building something similar in excel or access would be greatly appreciated.
Rich
Sure thing, I can help. Utilize "regular expressions" and .upper() and .lower string methods to get your job done.
Combine all of the above with the wildcard "*" , and use the Like comparison for some results -
Dim storetitle as String
Dim match as boolean
storetitle = Cells(1,3).Value
' start a loop here
... match = storetitle Like "*walmart*" 'where match is TRUE or FALSE depending on LIKE
utilize a function to create storetitle.upper(), storetitle.lower(), etc... and pass these into your test match....
I leave you a reference and urge to ask more questions if you need to.
https://msdn.microsoft.com/en-us/library/swf8kaxw.aspx
String functions toolkit :
https://msdn.microsoft.com/en-us/library/dd789093.aspx
depending on the abstractions / inconsistency in your input data, you may need to get creative and heavy using string functions

Resources