I have a 40G data file containing over 10 billions line.
Each line contains 2 columns, separate by comma, both columns are of the type 'float' and the first columns increases each row.
I want to show this data in a file, and I've tried Excel, but it said too many lines and can't display all.
Is there any other tools that can handle large data and show a line graph of the data?
Excel is not the right tool for files of that size. There is no prefect rule, but I usually move the data into a relational database if it is over 200,000 rows. In SQL you can perform the logic to create a smaller dataset that aggregates the data into a more manageable size.
If you cannot achieve an excel-worthy size, then you'll need to look into options like Tableau that can graph data in a relational database.
Related
I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.
I have a dataset which has close to 2 billion rows in parquet format which spans in 200 files. It occupies 17.4GB on S3. This dataset has close to 45% of duplicate rows. I deduplicated the dataset using 'distinct' function in Spark, and wrote it to a different location on S3.
I expected the data storage to be reduced by half. Instead, the deduplicated data took 34.4 GB (double of that which had duplicates).
I took to check the metadata of these two datasets. I found that there is a difference in the column encoding of the duplicate and deduplicated data.
Difference in column encodings
I want to understand how to get the expected behavior of reducing the storage size.
Having said that, I have a few questions further:
I also want to understand if this anomaly affect the performance in any way. In my process, I am having to do apply lot of filters on these columns and using distinct function while persisting the filtered data.
I have seen in a few parquet blogs online that encoding for a column is only one. In this case I see more than one column encodings. Is this normal?
I know a Parquet file stores column statistics on the column level inside each Row Group, to allow more efficient queries on top of the data.
Does it also store column statistics on the file level (to avoid reading entire files unnecessarily)? How about the column page level?
Parquet indeed stores min/max statistics for row groups, but those are not stored inside the row groups themselves but in the file footer instead. As a result, if none of the row groups match, then it is not necessary to read any part of the file other than the footer. There is no need for separate min/max statistics for the whole file for this, the row-groups-level stats solve this problem, since row groups are generally large.
Page-level min/max statistics exist as well, but are called column indexes and are only implemented in the unreleased 1.11.0 release candidate. They are a little bit more complicated than row group level min/max statistics, since row boundaries are not aligned with page boundaries, which necessitates extra data structures for finding corresponding values in all requested columns. In any case, this feature allows pinpointing the page-level location of data and radically improves the performance of highly selective queries.
I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.
I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.
I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.
I have many questions on whether to store my data into SQL or Table Storage and the best way to store them for efficiency.
Use Case:
I have around 5 million rows of objects that are currently stored in mysql database. Currently the metadata is stored only in the database. (Lat, Long, ID, Timestamp). The other 150 columns about the object that are not used much were moved into the Table Storage.
In the table storage, should these all be stored in one row with all the 150 columns not used much in one column instead of multiple rows?
For each of these 5 million objects in the database, there are certain information about them (temperature readings, trajectories, etc). The trajectory data used to be stored in SQL (~300 rows / object) but were moved to table storage to be cost effective. Currently they are stored in the table storage in a relational manner where each row looks like (PK: ID, RK: ID-Depth-Date, X, Y, Z).
Currently it takes time time grab many of the trajectories data. Table Storage seems to be pretty slow in our case. I want to improve the performance of the gets. Should the data be stored where each Objects has 1 row for its trajectory and all the XYZ's are stored in 1 column in a JSON format? Instead of 300 rows to get, it only needs to get 1 row.
Is the table storage the best place to store all of this data? If I wanted to get a X,Y,Z at a certain Measured Depth, I would have to get the whole row and parse through the JSON. THis is probably a trade-off.
Is it feasible to have the trajectory data, readings, etc in a sql database where there can be (5,000,000 x 300 rows) for the trajectory data. THere is also some information about the objects where it can be (5,000,000 x 20,000 rows). This is probably too much for a SQL database and would have to be in a Azure CLoud Storage. If so, would the JSON option be the best one? The tradeoff is that if I want a portion which is 1000 rows, I would have to get the whole table, however, isnt that faster than querying through 20,000 rows. I can probably split the data into sets of 1000 rows and use sql as a meta data for finding out which sets of data I need from the Cloud Storage.
Pretty much I'm having trouble understanding how to group data and format it into Azure Cloud Tables to be efficient and fast when grabbing data for my application.
Here's an example of my data and how I am getting it: http://pastebin.com/CAyH4kHu
As an alternative to table storage, you can consider using Azure SQL DB Elastic Scale to spread trajectory data (and associated object metadata) across multiple Azure SQL DBs. This allows you to overcome capacity (and compute) limits of a single database. You would be able to perform object-specific queries or inserts efficiently, and have options to perform queries across multiple databases -- assuming you are working with a .Net application tier. You can find out more by looking at http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-get-started/