Best way to process huge .csv

Best way to process huge .csv - python-3.x

I need to process a pretty huge .css (at least 10 millions rows, hundred of columns) with Python. I'd like:
To filter the content based on several criteria (mostly strings, maybe some regular expressions)
To consolidate the filtered data. For instance, grouping them by date, and for each date counting occurences based on a specific criterium. Pretty similar to what a pivot table could do.
I'd like to have an user-friendly access to that consolidated data
I'd like to generate charts (mostly basic line charts)
Processing must be fast AND light, because computers at work cannot handle much and we're always in a hurry
Given these prerequisites, could you please suggest some ideas? I thought about using pandas. I also thought about dumping the csv into a SQLite database (because it may be easier to query if I code an User Interface). But it is really my first foray into this world, so I don't know where to start. I don't have much time, but I'll would be very glad if you could offer some pieces of advice, some good (and fresh) things to read etc, interesting libs and so forth. Sorry if Stackoverflow is not the best place to ask for this kind of help. I'll delete the post if needed. Regards.

Give xsv a shot. It is quite convenient with decent speed. And it fits in the Unix philosopy. However if the dataset is used more than ten times, I'd suggest converting csv to some binary format, and ClickHouse is a good choice for that.

There are 2 rather different situations:
when your reports (charts, pivot tables) use limited number of columns from orignal CSV, and you can pre-aggregate your large CSV file only once to get much smaller dataset. This one-time processing can take some time (minutes) and no need to load whole CSV into memory as it can be processed as data stream (row-by-row). After that you can use this small dataset for fast processing (filtering, grouping etc).
you don't know which columns of original CSV may be used for grouping and filtering, and pre-aggregation is not possible. In other words, all 10M rows should be processed in the real-time (very fast) - this is OLAP use-case. This is possible if you load CSV data into memory once, and then iterate over 10M rows quickly when needed; if this is not possible, only option is to import it into the database. SQLite is a good lightweight DB and you can easily import CSV with sqlite3 command line tool. Note that SQL queries for 10M rows might be not so fast, and possibly you'll need to add some indexes.
Another option might be using specialized OLAP database like Yandex ClickHouse - you can use it to query CSV file directly with SQL (table engine=FILE) or import CSV into its column store. This database is lightning fast with GROUP BY queries (it can process 10M rows in <1s).

Related

How do I find out right data design and right tools/database/query for below requirement

I have a kind of requirement but not able to figure out how can I solve it. I have datasets in below format
id, atime, grade
123, time1, A
241, time2, B
123, time3, C
or if I put in list format:
[[123,time1,A],[124,timeb,C],[123,timec,C],[143,timed,D],[423,timee,P].......]
Now my use-case is to perform comparison, aggregation and queries over multiple row like
time difference between last 2 rows where id=123
time difference between last 2 rows where id=123&GradeA
Time difference between first, 3rd, 5th and latest one
all data (or last 10 records for particular id) should be easily accessible.
Also need to further do compute. What format should I chose for dataset
and what database/tools should I use?
I don't Relational Database is useful here. I am not able to solve it with Solr/Elastic if you have any ideas, please give a brief.Or any other tool Spark, hadoop, cassandra any heads?
I am trying out things but any help is appreciated.

Choosing the right technology is highly dependent on things related to your SLA. things like how much can your query have latency? what are your query types? is your data categorized as big data or not? Is data updateable? Do we expect late events? Do we need historical data in the future or we can use techniques like rollup? and things like that. To clarify my answer, probably by using window functions you can solve your problems. For example, you can store your data on any of the tools you mentioned and by using the Presto SQL engine you can query and get your desired result. But not all of them are optimal. Furthermore, usually, these kinds of problems can not be solved with a single tool. A set of tools can cover all requirements.
tl;dr. In the below text we don't find a solution. It introduces a way to think about data modeling and choosing tools.
Let me take try to model the problem to choose a single tool. I assume your data is not updatable, you need a low latency response time, we don't expect any late event and we face a large volume data stream that must be saved as raw data.
Based on the first and second requirements, it's crucial to have random access (it seems you wanna query on a particular ID), so solutions like parquet or ORC files are not a good choice.
Based on the last requirement, data must be partitioned based on the ID. Both the first and second requirements and the last requirement, count on ID as an identifier part and it seems there is nothing like join and global ordering based on other fields like time. So we can choose ID as the partitioner (physical or logical) and atime as the cluster part; For each ID, events are ordered based on the time.
The third requirement is a bit vague. You wanna result on all data? or for each ID?
For computing the first three conditions, we need a tool that supports window functions.
Based on the mentioned notes, it seems we should choose a tool that has good support for random access queries. Tools like Cassandra, Postgres, Druid, MongoDB, and ElasticSearch are things that currently I can remember them. Let's check them:
Cassandra: It's great on response time on random access queries, can handle a huge amount of data easily, and does not have a single point of failure. But sadly it does not support window functions. Also, you should carefully design your data model and it seems it's not a good tool that we can choose (because of future need for raw data). We can bypass some of these limitations by using Spark alongside Cassandra, but for now, we prefer to avoid adding a new tool to our stack.
Postgres: It's great on random access queries and indexed columns. It supports window functions. We can shard data (horizontal partitioning) across multiple servers (and by choosing ID as the shard key, we can have data locality on computations). But there is a problem: ID is not unique; so we can not choose ID as the primary key and we face some problems with random access (We can choose the ID and atime columns (as a timestamp column) as a compound primary key, but it does not save us).
Druid: It's a great OLAP tool. Based on the storing manner (segment files) that Druid follows, by choosing the right data model, you can have analytic queries on a huge volume of data in sub-seconds. It does not support window functions, but with rollup and some other functions (like EARLIEST), we can answer our questions. But by using rollup, we lose raw data and we need them.
MongoDB: It supports random access queries and sharding. Also, we can have some type of window function on its computing framework and we can define some sort of pipelines for doing aggregations. It supports capped collections and we can use it to store the last 10 events for each ID if the cardinality of the ID column is not high. It seems this tool can cover all of our requirements.
ElasticSearch: It's great on random access, maybe the greatest. With some kind of filter aggregations, we can have a type of window function. It can handle a large amount of data with sharding. But its query language is hard. I can imagine we can answer the first and second questions with ES, but for now, I can't make a query in my mind. It takes time to find the right solution with it.
So it seems MongoDB and ElasticSearch can answer our requirements, but there is a lot of 'if's on the way. I think we can't find a straightforward solution with a single tool. Maybe we should choose multiple tools and use techniques like duplicating data to find an optimal solution.

Persisting only part of a data source

I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface.
At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.
Hence my questions:
Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?

There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.
However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.
You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?

I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )

Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

What's faster for Stata: manipulating data in a flat database (i.e. Excel) or in a relational database?

I'm an entry-level optimization analyst at a company that publishes risk ratings data for various companies. We have tons of data (to the point where our history is currently solely limited by the number of rows possible in Excel).
We currently use many .do files in Stata to perform all manipulations and statistical analyses (the largest production we run takes 9 hours, with one insheet taking half a minute). I'm trying to convince the company to move away from using a flat database to using a relational database but have been having trouble finding information online about whether flat or relational is better in Stata. So--which is better, and why?

I would hypothesise that you answered your own questions by emphasising that limitations of Excel prevent you from capitalising on the full potential of your data. Excel is not a proper analytical tool or data warehousing solution and as such there is no point in using it in analytical projects involving anything more complex than doing some basic sums for a small business / household needs.
To answer your question:
Flat file databases are an archaic technology dating to the beginnings of computer science: they were never designed to meet modern analytical needs of working with Big Data, live data streams, etc.
Relational databases
help to avoid data duplication
help to avoid inconsistent records
are easier when changing the data format

storing massive ordered time series data in bigtable derivatives

I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.
I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).
I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.
Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?
What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?
What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?
Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).
Can some expert point me in the right direction or clear up any misunderstandings.
Thanks

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:
Don't worry about amount of data, it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.
Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?
Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:
MSFT-timestamp, GOOG-timestamp , ..etc
Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.
What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?
I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!
What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?
Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)
Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

If you're dealing with a massive time series database, then the standards are:
KDB: http://www.kx.com/
OneTick: http://www.onetick.com
Vhayu: http://www.vhayu.com
These aren't cheap, but they can handle your data very efficiently.

Someone whom I respect recommended the Open Time Series Database. In particular, that the schema was the nicest he had ever seen.
http://opentsdb.net/

'Am standing in front of the same mountain. My main problem with cassandra is that I cannot get a stream on the result set, for example in the form of an iterator.
I am looking already up and down the docs and the net, but nothing.
I can't fetch all the keys and then get the rows as billions of rows makes this impossible.

The DataStax Java Driver allows for automatic paging so that will stream the results just like an iterator and it's all built in. This is in Cassandra 2.0.1 by the way - http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Just for the sake of completeness reading this in 2018, there is now a special database just for timeseries data called TimescaleDB
http://www.timescale.com/
This blog is worth reading, it explains why it´s superior to solutions like Cassandra for that special case and why they decided to build it on top of the relational PostgreSQL database
https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string