Py-tables vs Blaze vs S-Frames - python-3.x

I am working on an exploratory data analysis using python on a huge Dataset (~20 Million records and 10 columns). I would be segmenting, aggregating data and create some visualizations, I might as well create some decision trees liner regression models using that dataset.
Because of the large data set I need to use a data-frame that allows out of core data storage. Since I am relatively new to Python and working with large data-sets, i want to use a method which would allow me to easily use sklearn on my data-sets. I'm confused weather to use Py-tables, Blaze or s-Frame for this exercise. If someone could help me understand what are their pros and cons. What are the factors that are important in this kind of decision making that would be much appreciated.

good question! one option you may consider is to not use any of the libraries aformentioned, but instead read and process your file chunk-by-chunk, something like this:
csv="""\path\to\file.csv"""
pandas allows to read data from (large) files chunk-wise via a file-iterator:
it = pd.read_csv(csv, iterator=True, chunksize=20000000 / 10)
for i, chunk in enumerate(it):
...

Related

How to parse big XML in google cloud function efficiently?

I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.

Does anyone know a dataset to test the delta lake/apache iceberg?

I'm looking for an example dataset (or several) to test Delta Lake and Apache Iceberg, but I couldn't find any.
I want to test the MERGE function of both and compare, but a small example is not possible to measure performance and define which one is better.
I would like a dataset with primary keys that starts with the first version of the table, and with multiple datasets (small or large) with the changes, that way I could test MERGE.
If anyone can help me, I appreciate it in advance.

How can I load my own dataset for person?

How can I load a dataset for person reidentification. In my dataset there are two folders train and test.
I wish I could provide comments, but I cannot yet. Therefore, I will "answer" your question to the best of my ability.
First, you should provide a general format or example content of the dataset. This would help me provide a less nebulous answer.
Second, from the nature of your question I am assuming that you are fairly new to python in general. Forgive me if I'm wrong in my assumption. With that assumption, depending on what kind of data you are trying to load (i.e. text, numbers, or a mixture of text and numbers) there are various ways to load the data. Some of the methods are easier than others. If you are strictly loading numbers, I suggest using numpy.loadtxt(<file name>). If you are using text, you could use the Pandas package, or if it's in a CSV file you could use the built-in (into Python that is) CSV package. Alternatively, if it's in a format that Tensorflow can read, you could use the provided load data functions.
Once you have loaded your data you will need to separate the data into the input and output values. Considering that Tensorflow models accept either lists or numpy arrays, you should be able to use these in your training and testing steps.
Checkout modules csv (import csv) or load your dataset via open(filename, „r“) or so. It might be easiest if you provide more context/info.

Considerations for time-series

We are looking into using Azure Table Storage (ATS) together with Deedle (or other libraries with similar functionality) for our time-series storage, manipulations and calculations. From what I can read, F# also seems like a good choice for operations on arrays.
Our starting point is a set of time-series for energy consumption. The series will either be the consumption within an interval (fixed or irregular intervals) or a counter (from which we can calculate the consumption from one reading to the next). As a data point is just a tag (used as a partition key), timestamp (rowkey) and value, this should be well suited for ATS.
From a user's perspective, they want to do calculations on the series for a given period and resolution, e.g. calculate a third series as a difference between two others, for one given year with monthly resolution.
This raises a number of questions:
Will ATS together with F# be fast enough? If we have 10.000 data points? 100.000? Compared to C#?
Resampling will require calculations of points between the series' timestamps. I haven't seen any Deedle examples for (linear) interpolation, but I assume that this is just passing a function which can look at the necessary data points? Will this be fast enough for our number of points?
The calculations will be determined by the users and we must have this as configurations. My best guess so far is to have the formula in some format we can parse easily into reverse polish notation, and take special care of tags that will represent series (ie. read from ATS, resample, then do the operations).
Any comments will be highly appreciated!
I think Isaac already mentioned the most important points, but as this question involves some of the things I'm involved with, I thought I'd share a few additional remarks!
BigDeedle. As Isaac mentioned, I used Azure Table storage in BigDeedle. This is mainly useful if you want to explore data interactively using Deedle APIs and do some filtering and range restriction before getting the data in memory and running your calculations. BigDeedle loads data lazily from potentially very big external data source. That said, if you eventually need to load all data into memory, this might not be all that useful for you.
The storage model used in BigDeedle might be useful though - it partitions data based on date, so when you want to get values in a given date range, it knows in which partitions to look. In my experience, loading data from ATS works pretty well, especially if you can do it on an MBrace cluster running in Azure (which is what my NDC demo does in the end).
Efficiency. I think the combination should work well for 10k or 100k data points - there will be no difference whether you do this from F# or C#. As for Deedle, I've definitely used it with data sets of this size - we optimize the library "as needed". Most of the functions are quite efficient already, but there may be some operations that are not efficient. This is something that can be fixed if you open issue on GitHub.
Resampling. There is built-in function for linear interpolation (see here), but I suspect you may need to write your own custom interpolation. Deedle does not "hide the underlying data" from you, so this is not too hard - the last example on this page shows a custom function for filling missing data that uses linear interpolation. If you are doing something like this, you'll need to have the data in memory (so BigDeedle would not be very useful here).
Specifying calculations. I suspect this is a separate question, but F# is great for domain-specific languages. I did a talk on that at earlier NDC. Generally, you can either specify your own DSL (and parse it) or have an embedded DSL where people write subset of F#. F# has good support for both.
PS: If you wanted to get some more help with F#, Deedle and Azure tables, feel free to get in touch. I'm happy to share my experience - you should be able to find a contact via my profile.
F# versus C# will probably be basically the same perf wise unless you do something completely different between the two (for example, immutable vs mutable data sets). Both compile down to IL at the end of the day.
Azure Table Storage - make sure you pick your partition + row keys correctly. There is a lot of documentation on picking Azure Table Storage partition keys, especially over time series - make sure you group rows up at the correct level to ensure data is distributed, with partitions not too large or small. You might also want to look at the Azure Storage Type Provider and / or Azure Storage F# libraries which makes working with ATS easier than the standard .NET SDK.
Deedle AFAIK does indeed have ability to replace missing values across time series, and there's at least a project called BigDeedle which works directly over ATS (although I'm not sure how ready this project is).

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

Resources