Slick: Odd error when using .firstOption.get | [SlickException: Read NULL value (null) for ResultSet column Path s2._5] - slick

So I have the following code used as a validation method:
if (TableQuery[UsersTable].filter(_.name === login).exists.run) {
val id = TableQuery[UsersTable].filter(_.name === login).firstOption.get.id
val name = TableQuery[UsersTable].filter(_.id === id).firstOption.get.name
}
if you're wondering, I check to make sure of .exists before I query the next two times because the login value can be equal to two columns in the database.
Anyways, I get [SlickException: Read NULL value (null) for ResultSet column Path s2._5] when attempting to get the id above, and I'm unsure why. There should be a first option there because the code has already validated a row exists for the requirements typed beforehand. No "id" column values are null.
How can I get this id value working correctly?

One of the involved columns is nullable but you didn't specify it as Option[...] in the Table.

Related

Problem with selecting json object in Pyspark, which may sometime have Null values

I have this big nested json object from which I need to make a Dataframe. One of the inner json elements sometimes come as empty and sometimes it comes with some values in it.
I am giving a simple example here:
When it is filled:
{"student_address": {"Door Number":"1234",
"Place":"xxxx",
"Zip code":"12345"}}
When it is empty:
{"student_address":""}
So, in the final DataFrame I have all the three columns Door Number, Place and Zip code. When the address is empty, I should put Null values in the respective columns and should fill them when there is data.
The code I tried:
test = test.withColumn("place",when(col("student_address") == "", lit(None)).otherwise(col("student_address.place")))\
.withColumn("door_num",when(col("student_address") == "",lit(None)).otherwise(col("student_address.door_num")))\
.withColumn("zip_code",when(col("student_address") == "", lit(None)).otherwise(col("student_address.zip_code")))
So, I am trying to check wether the value is empty or not.
This is the error I am getting:
AnalysisException: Can't extract value from student_address#34: need struct type but got string
I am not able to understand why PySpark is checking the statement in otherwise, when condition is met in when itself. (I tried giving simple values in otherwise instead of json path and it worked).
I am struggling to understand what is happening here and would like to know if there is any simple way to do this.
val addressSchema = StructType(StructField("Place", StringType, false) :: Nil) # Add more fields
val schema = StructType(StructField("address", addressSchema, true) :: Nil) # the point is address is nullable
val df = spark.read.schema(schema).json("example.json")
use Get JSON object with the first object with is Student_address and the column should be the name of the column.
df.withColumn("place",when(df.student_address== "", lit(None)).otherwise(get_json_object(col("student_address"),"$.student_address.place")))

compare and extract number from string in jpql

I have column named record_number of type varchar that has the following format data: [currentYear]-[Number] ex:2015-11
I need to search for the maximum number of this column; ie: if the value of the column that holds the maximum is 2015-15 and then the value should be 15, however if the column has a value of 2016-2, then the max should be 2.
how can I do it in jpql?
I'm using Postgres and EJB 3.1
You can use the SUBSTRING method of the JPA:
select table From Table table order by SUBSTRING(table.record_number, 5) desc;
To get only the first result, you need to use the method maxResults, like this:
em.createQuery("select table From Table table order by SUBSTRING(table.record_number, 5) desc")
.setMaxResults(1) -- only the first result
.getResultList()
I managed to fix the problem based on the comment of Dherik:
I used the following query to get the object that holds the correct value which seems more optimized than the one porposed by Dherik:
final TypedQuery<Table> query = createTypedQuery("from Table t where t.recordNumber= (select max(t.recordNumber) from t)", Table.class);
Table t= null;
try {
t = query.getSingleResult();
}catch(Exception e){
//handle Exception Here
}
return t;
The trick is since it's my app which creates the record number, I changed the method that creates the record number to format the number on 2 digits to avoid having wrong string comparaison (the case when '9' is considered greater than '10')
// format the number <10 so that is on 2 digits
final String formattedNumber = String.format("%02d", number);
final int year = SomeUtilClass.getYearFromDate(new Date());
return new StringBuilder().append(year).append("-").append(formattedNumber).toString();

Spark DataFrame created from JavaRDD<Row> copies all columns data into first column

I have a DataFrame which I need to convert into JavaRDD<Row> and back to DataFrame I have the following code
DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/to/orc/file");
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDD<Row> modifiedRDD = sourceFrame.toJavaRDD().map(new Function<Row,Row>({
public Row call(Row row) throws Exception {
if(row != null) {
//updated row by creating new Row
return RowFactory.create(updateRow);
}
return null;
});
//now I convert above JavaRDD<Row> into DataFrame using the following
DataFrame modifiedFrame = sqlContext.createDataFrame(modifiedRDD,schema);
sourceFrame and modifiedFrame schema is same when I call sourceFrame.show() output is expected I see every column has corresponding values and no column is empty but when I call modifiedFrame.show() I see all the columns values gets merged into first column value for e.g. assume source DataFrame has 3 column as shown below
_col1 _col2 _col3
ABC 10 DEF
GHI 20 JKL
When I print modifiedFrame which I converted from JavaRDD it shows in the following order
_col1 _col2 _col3
ABC,10,DEF
GHI,20,JKL
As shown above all the _col1 has all the values and _col2 and _col3 is empty. I don't know what is wrong.
As I mentioned in question's comment ;
It might occurs because of giving list as a one parameter.
return RowFactory.create(updateRow);
When investigated Apache Spark docs and source codes ; In that specifying schema example They assign parameters one by one for all columns respectively. Just investigate the some source code roughly RowFactory.java class and GenericRow class doesn't allocate that one parameter. So Try to give parameters respectively for row's column's.
return RowFactory.create(updateRow.get(0),updateRow.get(1),updateRow.get(2)); // List Example
You may try to convert your list to array and then pass as a parameter.
YourObject[] updatedRowArray= new YourObject[updateRow.size()];
updateRow.toArray(updatedRowArray);
return RowFactory.create(updatedRowArray);
By the way RowFactory.create() method is creating Row objects. In Apache Spark documentation about Row object and RowFactory.create() method;
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for
primitives, as well as native primitive access. It is invalid to use
the native primitive interface to retrieve a value that is null,
instead a user must check isNullAt before attempting to retrieve a
value that might be null.
To create a new Row, use RowFactory.create() in Java or Row.apply() in
Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._
// Create a Row from values.
Row(value1, value2, value3, ...)
// Create a Row from a Seq of values.
Row.fromSeq(Seq(value1, value2, ...))
According to documentation; You can also apply your own required algorithm to seperate rows columns while creating Row objects respectively. But i think converting list to array and pass parameter as an array will work for you(I couldn't try please post your feedbacks, thanks).

Cassandra DataStax driver: how to page through columns

I have wide rows with timestamp columns. If I use the DataStax Java driver, I can page row results by using LIMIT or FETCH_SIZE, however, I could not find any specifics as to how I can page through columns for a specific row.
I found this post: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CQL-3-and-wide-rows-td7594577.html
which explains how I could get ranges of columns based on the column name (timestamp) values.
However, what I need to do is to get ALL columns, I just don't want to load them all into memory , but rather "stream" the results and process a chunk of columns (preferably of a controllable size) at a time until all columns of the row are processed.
Does the DataStax driver support streaming of this kind? and of so - what is the syntax for using it?
Additional clarification:
Essentially, what I'm looking for is an equivalent of the Hector's ColumnSliceIterator using which I could iterate over all columns (up to Integer.MAX_VALUE number) of a specific row in batches of, say, 100 columns at a time as following:
SliceQuery sliceQuery = HFactory.createSliceQuery(keySpace, ...);
sliceQuery.setColumnFamily(MY_COLUMN_FAMILY);
sliceQuery.setKey(myRowKey);
// columns to be returned. The null value indicates all columns
sliceQuery.setRange(
null // start column
, null // end column
, false // reversed order
, Integer.MAX_VALUE // number of columns to return
);
ColumnSliceIterator iter = new ColumnSliceIterator(
sliceQuery // previously created slice query needs to be passed as parameter
, null // starting column name
, null // ending column name
, false // reverse
, 100 // column count <-- the batch size
);
while (iter.hasNext()) {
String myColumnValue = iter.next().getValue();
}
How do I do the exact same thing using the DataStax driver?
thanks!
Marina
The ResultSet Object that you get is actually setup to do this sort of paginating for you by default. Calling one() repeatedly or iterating using the iterator() will allow you to access all the data without calling it all into memory at once. More details are available in the api.

Error in Linq: The text data type cannot be selected as DISTINCT because it is not comparable

I've a problem with LINQ. Basically a third party database that I need to connect to is using the now depreciated text field (I can't change this) and I need to execute a distinct clause in my linq on results that contain this field.
I don't want to do a ToList() before executing the Distinct() as that will result in thousands of records coming back from the database that I don't require and will annoy the client as they get charged for bandwidth usage. I only need the first 15 distinct records.
Anyway query is below:
var query = (from s in db.tSearches
join sc in db.tSearchIndexes on s.GUID equals sc.CPSGUID
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where s.Notes != null && a.Attribute == "Featured"
select new FeaturedVacancy
{
Id = s.GUID,
DateOpened = s.DateOpened,
Notes = s.Notes
});
return query.Distinct().OrderByDescending(x => x.DateOpened);
I know I can do a subquery to do the same thing as above (tSearches contains unique records) but I'd rather a more straightfoward solution if available as I need to change a number of similar queries throughout the code to get this working.
No answers on how to do this so I went with my first suggestion and retrieved the unique records first from tSearch then constructed a subquery with the non unique records and filtered the search results by this subquery. Answer below:
var query = (from s in db.tSearches
where s.DateClosed == null && s.ConfidentialNotes != null
orderby s.DateOpened descending
select new FeaturedVacancy
{
Id = s.GUID,
Notes = s.ConfidentialNotes
});
/* Now filter by our 'Featured' attribute */
var subQuery = from sc in db.tSearchIndexes
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where a.Attribute == "Featured"
select sc.CPSGUID;
query = query.Where(x => subQuery.Contains(x.Id));
return query;

Resources