Replacing null values with zeroes in multiple columns [Spotfire] - spotfire

I have about 100 columns with some empty values that I would like to replace with zeroes. I know how to do this with a single column using Calculate and Replace, but I wanted to see if there was a way to do this with multiple columns at once.
Thanks!

You could script it but it'd probably take you as long to write the script as it would to do it manually with a transformation. A better idea would be to fix it in the data source itself before you import it so SPOTFIRE doesn't have to do the transformation every time, which if you are dealing with a large amount of data, could hinder your performance.

Related

Removing duplicates between multiple large CSV files

I am trying to find the best way to remove duplicates from large CSV files.
I receive CSV files of around 5/6 million rows every month.
I need to adjust these (I only need some of the columns, and I need to add some others).
The files also contain a lot of duplicate, and incomplete rows.
I've come up with a solution in python where I use a set and check for each row if it's in the set. And change what needs changing.
Now, I get the second file, and it contains a lot of duplicates that are in the previous file.
I'm trying to find an efficient solution to remove duplicates within the file, and between the different files. In the end I want to have a list (table or csv file) that contains only the new entries for that month.
I would like use python, and I was thinking about using a sqlite database for storing the data. But I'm unsure which way would be most efficient.
I would use numpy.unique():
import numpy as np
data = np.vstack((np.loadtxt("path/to/file1.csv"), np.loadtxt("path/to/file2.csv")))
#this will stack both arrays on top of each other, creating one giant array
data = np.unique(data, axis=0)
np.unique takes the entire array and returns only the unique elements. Make sure you set axis=0 so that it goes row by row and not cell by cell.
One caveat: This should work, but if there are several million rows, it may take a while. Still better than doing it by hand though! Good luck!

Is there a way to replace cells of a particular value with multiple cells?

Let's say I need to replace any cell that has a value of "outgoing" with multiple cells such as (0), (1), (0), (0), (2), in Excel. Is there a way to actually make this happen? I am doing this for a research project. Every item in my data needs to be coded on five different scales. There are 30-or-so items make up for almost half of the data. It would be enormously helpful to be able to simply replace the high frequency items with the five values at once.
I am not sure I completely understand the result you are looking for but here goes:
How about using the Find and Replace functionality to replace all instances of "outgoing" with "(0),(1),(0),(0),(2)" and then use the Text to Columns functionality to split the single column with "(0),(1),(0),(0),(2)" in to five separate columns, thus each value would be in its own cell.
You would need to split based on a delimiter (probably ",") and you should do all your replacing before you start splitting. Obviously you should test on some sample data first - Find and Replace is not your friend if you are not certain about your data set.

How to optimize COUNTIFS with very large data

I would like to create a report that look like this picture below.
My data has around 500,000 cells (it will continue to grow larger)
Right now, I'm using countifs function from excel but it takes a very long time to calculate. (cannot turnoff automatic calculate)
The main value is collected as date and the range of date is about 3 years, so I have to put a lot of formula to cover all range of value.
result
The picture below is the datasource the top one cannot be changed. , while the bottom is the one I created by myself (can change). I use weeknum to change date to week number.
data
Are there any better formula or any ways to make this file faster? Every kinds of suggestions are welcome!
I was thinking about using Pivot Table, but I don't know how to make pivot table from this kind of datasource.
PS. VBA is the last option.
You can download example file here: https://www.mediafire.com/?t21s8ngn9mlme2d
I will post this answer with the disclaimer that it is entirely dependent on the size of the data set. That turning on and off the auto calculate is the best way, but your question doesn't let me do that, so keep reading.
Your question made me curious, so I gave it a try and timed it. I essentially set up two columns of over 100,000 rand numbers choosing from 1-1000 and then tried to do a countif on the two columns if they were equal. I made a macro that I can run that turns off the autocalculate, inserts the start time, calculates, and then inserts the finish time. I highlighted in yellow the time difference.
First I tried your way, two criteria, countifs:
Then I tried to combine (concatenate) the two columns to see if I could make it easier by only having one countif criteria and data set. It doesn't. see result below:
Finally, realizing what was going on. I decided to make the criteria only match the FIRST value in the number to look for. I was essentially reducing the number of characters to check per cell. This had a positive result. See below:
Therefore my suggestion is to limit the length of the words you are comparing in anyway possible. You are mostly looking at dates, so you might have to get creative, but this seems to be the best way possible without going to manual calculation.
I have worked with Excel sheets of a similar size. Especially if you are using the data on a regular basis, I would heartily recommend switching to a proper database SQL based, Access, or whatever fits your purpose. I does wonders for the speed and also you won't run into the size limits of Excel. :-)
You can import the data you have now fairly easy.
I am happy as a clam with my postgresql db.

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

Fast repeated row counting in vast data - what format?

My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.

Resources