I have inherited a large Spotfire file with more than 100 columns. I want to reduce the file size by removing unused columns.
Aside from manually going through each visualization, is there a way to identify which columns are not being used?
Thanks!
Related
I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option
I know a Parquet file stores column statistics on the column level inside each Row Group, to allow more efficient queries on top of the data.
Does it also store column statistics on the file level (to avoid reading entire files unnecessarily)? How about the column page level?
Parquet indeed stores min/max statistics for row groups, but those are not stored inside the row groups themselves but in the file footer instead. As a result, if none of the row groups match, then it is not necessary to read any part of the file other than the footer. There is no need for separate min/max statistics for the whole file for this, the row-groups-level stats solve this problem, since row groups are generally large.
Page-level min/max statistics exist as well, but are called column indexes and are only implemented in the unreleased 1.11.0 release candidate. They are a little bit more complicated than row group level min/max statistics, since row boundaries are not aligned with page boundaries, which necessitates extra data structures for finding corresponding values in all requested columns. In any case, this feature allows pinpointing the page-level location of data and radically improves the performance of highly selective queries.
I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.
I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.
I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.
I have a 40G data file containing over 10 billions line.
Each line contains 2 columns, separate by comma, both columns are of the type 'float' and the first columns increases each row.
I want to show this data in a file, and I've tried Excel, but it said too many lines and can't display all.
Is there any other tools that can handle large data and show a line graph of the data?
Excel is not the right tool for files of that size. There is no prefect rule, but I usually move the data into a relational database if it is over 200,000 rows. In SQL you can perform the logic to create a smaller dataset that aggregates the data into a more manageable size.
If you cannot achieve an excel-worthy size, then you'll need to look into options like Tableau that can graph data in a relational database.
I have some data (say hits) for different cache size say 1M, 2M, 4M in three columns. I want to see Hits(1M) < Hits(2M) < Hits(4M). One way is to write two comparison operations, but I have many columns. Is there a way to check for something like 'ascending' order relationship between columns.
How many rows are you dealing with?
One solution (if you're not dealing with an overwhelmingly large number of rows) would be to take each row, and add the headers of each column above it, and then sort each row along with its own set of headers (which would be a two-row sort across all your columns). The resulting order of each header row would give you your desired answer.