Row identifier in sparklyr ml_kmeans - apache-spark

I was looking for a way to keep a row identifier attached to the output (specifically the cluster assignment) when performing cluster analysis with the sparklyr ml_kmeans function.
I have been simply
dropping the row identifier from the feature set
running mk_kmeans on the remaining data and then
using sdf_bind_cols to combine row id with cluster assignment
I thought however that it should be possible to hold onto the row identifier and avoid having to bind columns directly.
I've checked documentation but can't seem to find a way to do this. Any suggestions would be greatly appreciated,
Thanks,
Joe

Related

Assigning indexes across rows within python DataFrame

I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.

How can i update the column to a particular value in a cassandra table?

Hi I am having a cassandra table. my table has around 200 records in it . later i have altered the table to add a new column named budget which is of type boolean . I want to set the default value to be true for that column . what should be the cql looks like.
I am trying the following command but it didnt work
cqlsh:Openmind> update mep_primecastaccount set budget = true ;
SyntaxException: line 1:46 mismatched input ';' expecting K_WHERE
appreciate any help
thank you
Any operation that would require a cluster wide read before write is not supported (as it wont work in the scale that Cassandra is designed for). You must provide a partition and clustering key for an update statement. If theres only 200 records a quick python script or can do this for you. Do a SELECT * FROM mep_primecastaccount and iterate through ResultSet. For each row issue an update. If you have a lot more records you might wanna use spark or hadoop but for a small table like that a quick script can do it.
Chris's answer is correct - there is no efficient or reliable way to modify a column value for each and every row in the database. But for a 200-row table that doesn't change in parallel, it's actually very easy to do.
But there's another way that can work also on a table of billions of rows:
You can handle the notion of a "default value" in your client code. Pre-existing codes will not have a value for "budget" at all: It won't be neither true, nor false, but rather it will be outright missing (a.k.a. "null"). You client code may, when it reads a row with a missing "budget" value, replace it by some default value of its choice - e.g., "true".

Sequence number Equivalent in sybase ase

I have an existing sybase ase table which is using IDENTITY as its primary key. Now i need to recreate this table but i want to start the PK from next value of IDENTITY PK in prod env. e.g. If currently PK = 231 then after re-creating i want it to start from 232 onwards or any other INTEGER value > 231.
In oracle its easy to configure a sequence number and we can give start with but in sybase ase as we dont have sequence available so i tried using newid() function but it gives binary(16) values whereas i want integer values.
Can anyone suggest something ?
I am planning to use something like mentioned below and i think it will resolve my problem. Let me know if anyone has a better solution.
select abs(hextoint(newid()))
Any thoughts on this solution ? Can this ever generate the same number which it generated already?
select next_identity('tablename') will return the identity value of the next insert for a table with an identity column so you know which ID will be allocated next.
Select ##identity immediately after an insert will return the ID which was just given to the row inserted.
However you need to be careful as identity columns are not the same as sequences and should not be relied upon if you want a sequence with no gaps because you will get a gap (albeit small sometimes) if the database crashes or is shutdown with nowait. For these a number fountain/insert trigger type generation of IDs is a better option. Using 'identity insert' is only really for when you want to bulk-load a whole table - you should not be setting that with every insert or you will defeat the whole purpose of an identity column, which is fast generation of new key values.

Astyanax retrieve by UUID

How do we retrieve a row in Cassandra using Astyanax?
I have a web application which requires pagination to be done on the server side, the db is cassandra. The row key is a UUID and I have few columns within a row, so I am trying to do pagination on the row keys.
I have put together a solution with which i am not completely happy. The issue is that when i do my first search based on the search filter i get from UI, I don't know the UUID of the first row, So I prepare a Query which gives me first 6 records, I store the key of the 6th record in a MAP and put it in a session, so when user from UI request for a second page, i retrieve this key(UUID) and take it as a start for the next set of records to be retrieved. I was trying to find a cleaner approach.
EDIT in response to question changes...
In that case, sounds like you are doing it just fine. Also sounds like you are using OOP then as otherwise the rows are not in order. While playOrm's solution is more elegant returning you a cursor that you store in the session, I think what you have is just fine.
EDIT since our code changed
line 74 in this link is how we do it(you can drill down into that cursor class basically to see what we do).
https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
which if you have 10 row keys you just pass in the list of keys.
I am not sure what you mean by pagination???? Are you saying you have a from and to row key and want the rows between them? Are you using order the cluster by row key then?
Another completely different direction to go for ordering is playOrm though which can do S-SQL and if things are partitioned, you can do joins and other such(that link above is to one of the files in playOrm actually).
I am really not sure what you mean "you don't have the row key with you"
later,
Dean

Can you nest Excel data tables?

I have an Excel workbook that utilises a data table (A).
I now want to create another data table (B) that effectively sits on top of the other data table. That is, each "iteration" of B calls A.
This approach fails although I cannot find any documentation about data tables that indicates that this would not work.
Basically I'd like to know if anyone has tried this before and whether I am missing something?
Is there a workaround? Do you know of any documentation that spells out whether and why this is not supported?
No.
I tried this at length some years ago in both xl03 and xl07 and my conclusion was that it can't be done - each data table seems to be an independent one-off run, they don't talk if you try to link them
I couldn't find any documentation on this issues either on the process, or for anyone else looking at a similar problem.
I want to share my experience using the data tables.
We have found a workaround for this problematic.
If you have two variables A & B that need to run into a datatable and get one or multiple result.
What we've done is :
Set any combinaison (binari combinaison) for A & B and put an id for each of this combinaison (A=0 & B=0 => id=1)
So you will then run one data table with a length of A*B.
The default here is the length to calculate those data (7min with 25 data table & 2 data table with a length of 8000 rows).
Hope it help !

Resources