I have the following use case for building a Data Lake (e.g. in Azure):
My organization deals with companies that go into bankruptcy. Once a company goes bankrupt, it needs to hand over all of their data to us, including structured data (e.g. CSVs) as well as semi-structured and unstructured data (e.g. PDFs, Word documents, images, JSON, .txt files etc.). Having a data lake would help here as the volumes of data can be large and unpredictable and Azure Data Lake seems like a relatively low-cost and scalable storage solution.
However, apart from storing all of that data we also need to give business users a tool that will enable them to search through all of that data. I can imagine two search types:
searching for specific files (using file names or part of file names as the search criteria)
searching through all text files (word documents, .txt and PDFs) and identifying those files that meet the search criteria (e.g. a specific phrase being searched for)
Are there any out of the box tools that can use Azure Data Lake as a data source that would enable users to perform such searches?
Unfortunately, there isn't a tool can help you filter the files directly in Data Lake for now.
Even Azure Storage Explorer only support search by prefix.
Data Factory support we filter the files, but it usually used for copy and transfer data. Reference: Data Factory supports wildcard file filters for Copy Activity
Update:
Azure Cognitive Search seems to be a good choice.
Cognitive Search supports import source from Data Lake, and it provide the filter to help us search the files.
A filter provides criteria for selecting documents used in an Azure Cognitive Search query. Unfiltered search includes all documents in the index. A filter scopes a search query to a subset of documents.
We could reference from Filters in Azure Cognitive Search
Hope this helps.
Cognitive Search with Azure Data Lake is definitely an option and it is Microsoft recommends. Several factors we need to consider:
Price. https://azure.microsoft.com/en-us/pricing/details/search/. Not a cheap option.
Size of your source data and index you need.
Your acknowledgment of other open-source services. ELK is a popular open-source framework for full-text searching.
Related
The Copy activity does not support Azure Cognitive Search as a source. Sink is fine but not a source. Makes transferring indexed documents from one index to another tedious: an until around get + post batches to the Search API with conditional variables to break the outer until iteration.
Easier way?
As of now Azure cognitive search is not supported as source. Azure cognitive search can be used as Sink only in Azure data factory copy activity. Refer the official documentation for supported data stores
Is there any way to use KQL to query a large local file (10k+ rows) such as Excel, CSV etc. alongside data hosted in Kusto (Azure Data Explorer)?
Here is my scenario:
I extensively use KQL to explore data hosted in Kusto (Azure Data Explorer) clusters. Mostly these explorations are very dynamic and in one-off scenarios to investigate situations.
For some data, I just have Excel and CSV files that I want to join with Kusto data. I know I could do this with Pandas, but I'm specifically asking if there's any way to do it with KQL, preferably without setting up a cluster and ingesting the data into a Kusto table.
There are a few ways this can be done, but there is one requirement which is that the data needs to be accessible from the Kusto cluster, for your scenario, the files need to be in Azure Storage. The lightest approach is using the externaldata operator, but you can also set up an external table.
Also please note that you can get your own free cluster to do this processing, to create it go to http://aka.ms/kustofree
Moving from AWS Glue to Azure Purview and i am confused about something
Its is possible to query Azure purview data catalog/assets in the same way we can query from AWS Glue data catalog using AWS Athena?
Unfortunately, you cannot query data from Azure Purview.
The Purview search experience is powered by a managed search index. After a data source is registered with Purview, its metadata is indexed by the search service to allow easy discovery. The index provides search relevance capabilities and completes search requests by querying millions of metadata assets. Search helps you to discover, understand, and use the data to get the most value out of it.
The search experience in Purview is a three stage process:
The search box shows the history containing recently used keywords
and assets.
When you begin typing the keystrokes, the search suggests
the matching keywords and assets.
The search result page is shown with assets matching the keyword entered.
For more details, refer to Understand search features in Azure Purview.
I have a large amount of diagnostics data stored in an Azure Blob Storage. Is there any way I can get that data searchable from my Azure SQL database? I would like to join on some custom data fields in my blob stored data.
Blob storage doesn't have searchable metadata, per se: You may certainly search containers for given blob names, and you may even enumerate blobs to look at their metadata. But aside from the container/blob URI, there's no built-in search mechanisms.
If you want to search for metadata, you'll need to build your own data store (e.g.e in a searchable database such as SQL Database, the one you mentioned). This would be completely up to your app to do (you'd need to extract specific data you want to search, and store it in your database engine of choice). You'd then need to link your database engine's contents back to blob storage (e.g. store a blob's url alongside its metadata).
If you're talking about full-text search, you'd need to employ an appropriate fts tool. Azure provides Azure Search as a 1st-party full-text-search service, or you may certainly use a 3rd-party tool or service. What you choose is completely up to you.
I am using Azure for hosting my project and chosen blob to store all by files (as they are in megabyte and count is huge). I have a requirement to search within all my files in blob (kind of like full text search). I tried integrating it with Azure search but no luck as the indexes are made on sql only. Is there a way to integrate the full text search in blob?
If not, what would be effective way of storing the documents in azure and still making them searchable (full text search) just like what sharepoint provides.
I work on Azure Search. We just shipped preview support for indexing documents stored in Azure blob storage, with support for PDF, Office docs, HTML and a few other formats. Please see https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/ for more details.
Thanks,
Eugene
You can try azure search which now supports cognitive search[Preview] where it does image recognition using OCR. It does a great job with pdf and all type of documents.
It works good even with scanned document.
There is an online demo from microsoft on azure search which does a great job. https://jfk-demo.azurewebsites.net/