How can I load my own dataset for person? - python-3.x

How can I load a dataset for person reidentification. In my dataset there are two folders train and test.

I wish I could provide comments, but I cannot yet. Therefore, I will "answer" your question to the best of my ability.
First, you should provide a general format or example content of the dataset. This would help me provide a less nebulous answer.
Second, from the nature of your question I am assuming that you are fairly new to python in general. Forgive me if I'm wrong in my assumption. With that assumption, depending on what kind of data you are trying to load (i.e. text, numbers, or a mixture of text and numbers) there are various ways to load the data. Some of the methods are easier than others. If you are strictly loading numbers, I suggest using numpy.loadtxt(<file name>). If you are using text, you could use the Pandas package, or if it's in a CSV file you could use the built-in (into Python that is) CSV package. Alternatively, if it's in a format that Tensorflow can read, you could use the provided load data functions.
Once you have loaded your data you will need to separate the data into the input and output values. Considering that Tensorflow models accept either lists or numpy arrays, you should be able to use these in your training and testing steps.

Checkout modules csv (import csv) or load your dataset via open(filename, „r“) or so. It might be easiest if you provide more context/info.

Related

Moving datasets from brightway2 to openLCA

So, from the brightway2 documentation I see that I can export datasets in Excel files, simaPro CSVs, ecospodld 1&2s, or JSONs (with the bw2 custom structure from what I understand).
openLCA allows imports form excel, simaPro CSV, JSON-LD among others.
So before I go though all combinations and potentially run into dead ends, I was wondering if there is a preferred (if any?) way to do that?
thanks!
OpenLCA uses a slightly different mental model than Brightway, so I am not sure what the best way to do this is (Brightway makes a strong assumption that there is a graph with fixed edges, while OpenLCA allows more flexibility). There certainly isn't one recommended way.
Cauldron would be willing to fund a OpenLCA-compatible olca JSON schema exporter, but this doesn't help you right now...
You could also consider asking in the OpenLCA forums.

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

PDF Crawler with Deep Analytics Skills

I am trying to build a pdf crawler for annual reports of corporates - these reports are pdf documents with a lot of text and also a lot of tables.
I don't have any trouble with converting the pdf into a txt, but my actual goal is to search for certain keywords (for example REVENUE, PROFIT) and extract the data Revenue 1.000.000.000€ into a data frame.
I tried different libraries, especially tabula-py and PyPDF2 but I couldn't find a smart way to do that - can anyone please help with a strategy, it would be amazing!
Best Regards,
Robin
Extracting data from PDFs is tricky business. Although there are PDF standards , not all PDFs are created equal. If you can already extract the data you need in text form, you can use RegEx to pull the data you require.
Amazon have a machine learning tool called Textract which you can use alongside their boto3 SDK in Python. However, this is a 'pay-per' service. The main difference with using Textract to regular expressions is that Textract can recognise and format data pairs and tables which should mean that creating your 'crawler' is quicker and less prone to breaking if your PDFs change going forward.
There is a Python package called Textract but it's not the same as the one provided in AWS, rather, it's a wrapper that (for PDFs) uses pdftotext (default) or pdfminer.six. It's worth checking it out as it may yield your data in a better format.

How to apply a dprep package to incoming data in score.py Azure Workbench

I had been wondering if it were possible to apply "data preparation" (.dprep) files to incoming data in the score.py, similar to how Pipeline objects may be applied. This would be very useful for model deployment. To find out, I asked this question on the MSDN forums and received a response confirming it were possible, but little explanation about how to actually do it. The response was:
in your score.py file, you can invoke the dprep package from Python
SDK to apply the same transformation to the incoming scoring data.
make sure you bundle your .dprep file in the image you are building.
So my questions are:
What function do I apply to invoke this dprep package?
Is it: run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) ?
How do I bundle it into the image when creating a web-service from the CLI?
Is there a switch to -f for score files?
I have scanned through the entire documentation and Workbench Repo but cannot seem to find any examples.
Any suggestions would be much appreciated!
Thanks!
EDIT:
Scenario:
I import my data from a live database and let's say this data set has 10 columns.
I then feature engineer this (.dsource) data set using the Workbench resulting in a .dprep file which may have 13 columns.
This .dprep data set is then imported as a pandas DataFrame and used to train and test my model.
Now I have a model ready for deployment.
This model is deployed via Model Management to a Container Service and will be fed data from a live database which once again will be of the original format (10 columns).
Obviously this model has been trained on the transformed data (13 columns) and will not be able to make a prediction on the 10 column data set.
What function may I use in the 'score.py' file to apply the same transformation I created in workbench?
I believe I may have found what you need.
From this documentation you would import from the azureml.dataprep package.
There aren't any examples there, but searching on GitHub, I found this file which has the following to run data preparation.
from azureml.dataprep import package
df = package.run('Data analysis.dprep', dataflow_idx=0)
Hope that helps!
To me, it looks like this can be achieved by using the run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) method from the azureml.dataprep.package module.
From the documentation :
run_on_data(user_config, package_path, dataflow_idx=0, secrets=None, spark=None) runs the specified data flow based on an in-memory data source and returns the results as a dataframe. The user_config argument is a dictionary that maps the absolute path of a data source (.dsource file) to an in-memory data source represented as a list of lists.

Py-tables vs Blaze vs S-Frames

I am working on an exploratory data analysis using python on a huge Dataset (~20 Million records and 10 columns). I would be segmenting, aggregating data and create some visualizations, I might as well create some decision trees liner regression models using that dataset.
Because of the large data set I need to use a data-frame that allows out of core data storage. Since I am relatively new to Python and working with large data-sets, i want to use a method which would allow me to easily use sklearn on my data-sets. I'm confused weather to use Py-tables, Blaze or s-Frame for this exercise. If someone could help me understand what are their pros and cons. What are the factors that are important in this kind of decision making that would be much appreciated.
good question! one option you may consider is to not use any of the libraries aformentioned, but instead read and process your file chunk-by-chunk, something like this:
csv="""\path\to\file.csv"""
pandas allows to read data from (large) files chunk-wise via a file-iterator:
it = pd.read_csv(csv, iterator=True, chunksize=20000000 / 10)
for i, chunk in enumerate(it):
...

Resources