Data on segmentation of US search traffic by topic - search

I'm working on a research project trying to understand the patterns and breakdowns of search usage and volumes in the United States.
Ideally, I would love a breakdown of search volume across topics like:
navigational (ie just want to get to a domain link)
news (if possible: split amongst events, celebs, politics, ...)
sports (if possible: dig into splits of live scores, news about an athlete or a team, ... )
finance (e.g. stock names )
anything local (e.g. food, restaurants, places)
people (e.g. bios)
anytime time related (what time in nyc, sf, ...)
anything numbers related (math/calculators)
Other topics: immigration, legal, health/medicine, science/technology, food/recipes, code/math, politics, weather, images/video, etc.
Not sure if there is a dataset or good report somewhere that would give me insight into all these?
There seem to be a lot of keyword planning tools, which is somewhat helpful and I guess I could collect data on groups of keywords realted to the topics above, but for things like celebrity bios it would be quite difficult to group together all the data because each possible well known person is there own keyword…
Any help, direction would be appreciated! Thank you so much

Related

How to find popular Google search terms for a particular demographic/location/interest group?

I'm starting an online business targeted at a particular demographic and interests so I would like to produce content targeted at what this particular target market are actually searching for.
Google Ads allowed me to refine my target audience to the exact categories (demographics and interests) I needed but I couldn't tell me what that category of people tend to search for except for the tiny subset that happens to click on one of my ads which is very rare given I am just starting with a small budget. I would like to know the most popular search terms for everyone in the categories I specified not just those who happened to click on my ads.
I tried Google Trends, that told me the popularity of a particular search term for a given country but that's too broad - I need to narrow it down to a particular city, age group, parental status and interests. Google Trends also helped me find popular related search terms given a particular search term so I could try using that to see if there are any common popular related search terms related to my guesses but I could miss terms related to terms I never thought of.
I could try producing content across a rage of topics which I think my target audience might be interested in and then analyse the results using Google Ads but that could be a very expensive trial and error process and I might miss more popular topics which I never thought of.
Of course I could try to ask my target market in person directly (by interrupting people in the street!) but that would be very expensive for me because I would have to travel to and stay at the location where my online business is targeted, hoping to meet people with the exact same demographic and interests that I am looking.
I'm sure there must be a way to figure this out using the the Google search analytics. Essentially, all I need is a list of most popular recent Google search terms for a particular location, demographic and interests group in Google Analytics. Could anyone help me understand how to get this list?
Here are a few considerations, even if you found an answer.
Take a look at the AdRoll platform. Here's a potentially helpful article from them about target audience and demographics.
A recent article about AdWords demographic targeting. An older looking article, connecting demographics to search queries, but page's source code suggests an update this year.
Last but not least, you're probably eligible to talk with a Google Small Business Advisor.

is there API for past NOAA weather forecasts (forecast archive)?

I'm looking for a source for old weather forecasts--yesterdays, last months, last years. For major cities in US.
Seems like it's easy to find future forecasts, and historical actual data, but not historical forecasts.
The product you're probably looking for is the National Digital Forecast Database, the gridded system the NWS uses to input most of its forecast. There's no API that I know of, but there are archived data files in places like here. This NWS page on degrib also offers some potential hints on what you may need.
The NWS does still also issue some specific point forecasts for certain locations, specialized forecasts for events like fires, plus forecast discussions, warning text, etc. If those are the types of things you are looking for, it may be a bit more of a slog to dig through and piece together find the product identifiers and archive resources you want. Iowa State offers a tool for accessing some of the past data, but only by office. You also may want to dig into some of the text products on their MTArchive site, particularly perhaps the Public files - the specific data is less organized, yet the simple layout may make it more straightforward to find what you need. This StormTrack thread may offer one final rabbit trail towards finding archives of NWS text products.
As mentioned in comments, you may also find there are additional users with useful input on the Earth Science Stack Exchange Beta community.

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

What is the correct way to implement a massive hierarchical, geographical search for news?

The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever.
The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds.
I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore.
Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)
Google can search the entire internet, and your data via a Google Appliance!

Resources