related files dataset with actual files - nlp

Hello Im making an algorithm to enhance HDFS performance, a machine learning model will run on the existing files to find relevant files (articles, movies, ..) based on some features to the user's target file and pre-fetch them for the user.
I need a dataset to analyze its summary with actual files to be stored on HDFS, Like articles dataset where I will categorize news articles together, fashion articles and so on.
All datasets I find on kaggle either only .csv file which summarizes the dataset without actual files.
Or actual files without a .csv file

Related

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

Is there an equivalent to SAS sas7bdat table files in Python?

Is there a Python equivalent to reading and writing tabular files such as SAS sas7bdat files?
My team is moving away from SAS and we'd like to replicate the SAS process in Python with our methodology as follows:
1) Pull data from various sources i.e. Excel, CSV, DBs etc.
2) Update our Data Warehouse with the new information and export this data as a Python table file (to be used next)
3) Rather than pulling data from our warehouse (super slow) we'd like to read in those Python table files and then do some data matching on a bigger set of data.
We're trying to avoid using the sas7bdat altogether (SASPy) files since we won't have SAS for much longer
Any advice, insights is greatly appreciated!
Unlike SAS, Python doesn't have a native data format. However, there are modules that implements binary protocols for serializing and de-serializing a Python object. Consider using HDF5 format to save and read files (https://www.h5py.org/). Another possibility is Pickle (https://docs.python.org/3/library/pickle.html).
Parquet is also worth considering.

Azure POS and weather data analysis strategy

I have a question about an approach of a solution in Azure. The question is how to decide what technologies to use and how to find the best combination of them.
Let's suppose i have two data sets, which are growing daily:
I have a CSV file which comes daily to my ADL store and it contains weather data for all possible Lattitudes and Longtitudes combinations and zip codes for them, together with 50 different weather variables.
I have another dataset with POS (point of sales), which also comes as a daily CSV file to my ADL storage. It contains sales data for all retail locations.
The desired output is to have the files "shredded" in a way that the data is prepared for AzureML forecasting of sales based on weather, and the forecasting is done per retail location and delivered via PowerBI dashboard to each one of them. A requirement is not to allow different location see the forecasts for any other locations.
My questions are:
How do I choose the right set of technologies?
how do I append the incoming daily data?
How do I create a separate ML forecasting results for each location?
Any general guidance on the architecture topic is appreciated, and any more specific ideas on comparison of different suitable solutions is also appreciated.
This is way to broad of a question.
I will only answer your ADL specific question #2 and give you a hint on #3 that is not related to Azure ML (since I don't know what that format is):
If you just use files, add date/time information to your file path name (either in folder or filename). Then use U-SQL File sets to query the ranges you are interested in. If you use U-SQL Tables, use PARTITIONED BY. For more details look in the U-SQL Reference documentation.
If you need to create more than one file as output, you have two options:
a. you know all file names, write an OUTPUT statement for each file selecting only the relevant data for it.
b. you have to dynamically generate a script and then execute it. Similar to this.

How do I create an RDD from input directory containing text files?

I am working with the 20 newsgroup dataset. Basically, I have a folder and n text files. The Files in the folder belong to the topic the folder is named. I have 20 such folders. How do I load all this data into Spark and make an RDD out of it, so that I can apply machine learning transformations and actions on them? (Eg: like naive bayes). I'm looking for ways to create a RDD. Not help with how to apply the algorithms.
You can use SparkConext.wholeTextFiles(...). It reads a directory and creates RDD for all the files within that directory.

Generating different datasets from live dbpedia dump

I was playing around with the different datasets provided at the dbpedia download page and found that it is kind of outdated.
Then I downloaded the latest dump from the dbpedia live site. When I extracted the June 30th file, I just got one huge 37GB .nt file.
I want to get different datasets (like the different .nt files available at the download page) from the latest dump. Is there a script or process to do it?
Solution 1:
You can use dbpedia live extractor.https://github.com/dbpedia/extraction-framework.
You need to configure proper extractors(Ex: infobox properties extractor, abstract extractor ..etc). It will download the latest wikipedia dumps and generates the dbpedia datasets.
You may need to make some code changes to get only the required data. One of my colleague did this for German data sets. You still need a lot of disk space for this.
Solution 2(I don't know whether it is really possible or not.):
Do a grep for the required properties on the datasets. You need to know the exact URIs of the properties you want to get.
ex: For getting all the home pages:
bzgrep 'http://xmlns.com/foaf/0.1/homepage' dbpedia_2013_03_04.nt.bz2 >homepages.nt
It will give you all the N-triples with homepages. You can load that in the rdf store.

Resources