I have a list of numbers, ranging from 100000 to 101000 and i need to find which ones are not in order, is there anyway to do this ? As i dont want to go through a list of 1000 numbers
PS. I am taking this data from SQL So in this instance i cannot sort the data. I just need to know which are not in correct order
If your numbers start in A1 then:
=IF(SMALL(A:A,ROW())=A1,"")
in Row1 and copied down should indicate those that are out of order.
If it is just the number in the SQL entry you can sort it directly when writing the query using ORDER BY in your SQL statement
Related
I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html
Say I have a table of 20K records and want to sort by colum A and C. There are 20 records with identical A and C entries, how does excel decide how to order these 20 records? Does it order by the value in the next available column not already defined in the sort? Some magic sum of not null fields in that row? I am at a loss and can't find the answer anywhere, any clarification would really help as I don't think defining a sort by every available column is an efficient way to solve this problem.
Excel SORT preserves the existing order of duplicate records.
I have a table in Excel that I want to filter. It will have a maximum of 1 million rows and 80 columns. All the calculations etc are done programatically in arrays to cut dwn processing time. However, I want to also filter the results to display only certain results based on one column value, followed by a top 5% based on another filter value.
When I first did the sheet, it was limited to 65000 results so there were no problems with the size of the data set. I just invoked the worksheet filter functions from code and did it that way. Can I do it that way with a larger data set or is there a way to filter an array the way you d a dataset on a sheet?
Thanks
As already mentioned by everyone, excel 2007 will take you to a million rows, but its slower than the excel 2003 that I presume you're using at the moment so filtering using it wouldn't be advisable.
Along with mysql, ms access is also an option.
You really should put that data in an Access table and use Excel's Database Query to do the job. Since it can also filter retrieved data based on a cell value, it's a great combination.
Storing the data in a database brings you another interesting option (depending on what you want to do): to query your database using PowerPivot.
Although using a relational DB would be preferable in many ways, if you don't have any formulas then filtering your data (1 million rows by 80 columns) using Excel will be reasonably fast (< 1 or 2 seconds depending on what sort of filtering you want to do, which will probably be faster than an un-indexed DB table) assuming that you have enough RAM. If you do have any formulas then you will probably need to be in Manual calculation mode to avoid the filtering process triggering multiple recalculations.
Alright I have little to no knowledge of SQL language, and am wondering what are the possible reasons for the slowness of two WITH vs one WITH in unidata.
Database has around ~1 million rows.
Ie/
SELECT somewhere WITH Column1 = "str" AND WITH Column2 = "Int" 5< minutes
Compared to
SELECT somewhere WITH Column1 = "str" ~1 second
somewhere is indexed (from my knowledge)
so is there anything I'm doing wrong?
If more information is required just ask, not sure what to supply.
Also whats the difference between WITH and WHERE?
This isn't SQL, it is UniQuery.
To clarify it for you, you can't index the file (somewhere, in this case), only the columns of the file. You might find Column1 is indexed and Column2 is not. Type in LIST.INDEX somewhere to find out what columns have been indexed.
For your question, you have only compared selecting on Column1 against selecting on Column1 & Column2 and assumed the vastly slower response is purely because you selected on 2 columns. Your next text should have been to select only on Column2 and seen how slow that was.
There are are many possible reasons to explain the difference in response, aside from indexing. In UniData columns are defined as 'dictionary items' There are different types of dictionary items. The most basic is a D-type dictionary item which is just a direct reference to a field in the record. Another type is the I or V-type, which is a derived field. The derived field can be as simple as returning a constant or as complex as performing an equivalent performing a JOIN with another file and/or some form of complex calculation. This this is should be simple to see that different columns can take vastly different amounts of processing to handle.
Other reasons are how deep in the record the column is (first field references will be faster than fields later in the record) as well as potential query caching that can affect the timings of your SELECTs.
For more information, check out the database's manuals at Rocket Software.
A single column SELECT on an indexed field will not even require that any data file records are read. If you look under the hood, you'll see that the index file is a normal hash file, and the single column SELECT will simply mean that the index file record with the key "str" is read. This could return thousands and thousands of keys in less than a second.
Once you add the second column, you are probably forcing the system to read all of those thousands and thousands of records, EVEN IF THE SECOND COLUMN IS INDEXED. This is going to take a measurable amount of more time.
In general, an index on a field with a small number of unique values is of dubious use. If the second column contains data that has a large number of possible values, leading to a smaller number of records with each particular index value, then it would be best to arrange the SELECT such that the index used is on the second column. I'm not sure, but it might be possible to simply reverse the order of the columns in the SELECT statement to do this. Otherwise you might need to run two SELECT statements back to back.
As an example, assume that the file has 600,000 records with Column1 = "str", and 2,000 records with Column2 = "int":
>SELECT somewhere WITH Column2 = "int"
>>SELECT somewhere with Column1 = "str"
Will read 2,000 records and should return almost instantly.
If the combination of Column1 and Column2 is something that you'll be SELECTing on frequently, then you might want to create a new dictionary item that combines the two, and build an index on that.
That being said, it shouldn't take a U2 system 5 minutes to run through a file of a million records. There's a very good chance that the file has become badly overflowed, and needs to be resized with a larger modulo to improve performance.
I am relatively new to columnar database, please forgive ignorance. Lets say I have 1,000,000 columns. I would like to return a random sample of 10% of those columns (ie c0, c10, c20...c999,980, c999,990)
In HBase they have column filters, I could write a column filter that returned every tenth result. Can I do this in Pycassa/Cassanda?
Thank you
The only thing you can do server side is slices. So you can read starting at column=C10 limit=10 to get columns 10-19. Or you can ask for specific columns, so you could ask for every 10th column manually if you knew how many columns there were.
You could do this easily client-side with Pycassa, but Cassandra does not support server-side filtering.