I am using spark-sql.2.4.1v , datastax-java-cassandra-connector_2.11-2.4.1.jar and java8.
I have cassandra table like
create company(company_id int PRIMARY_KEY, company_name text);
JavaBean as below
#Table(name = "company")
class CompanyRecord(
#PartitionKey(0)
#Column(name="company_id")
Integer companyId;
#Column(name="company_name")
String companyName;
//getter and setters
//default & parametarized constructors
)
I have spark code below save the data into cassandra table.
Dataset<Row> latestUpdatedDs = joinUpdatedRecordsDs.select("company_id", "company_name"); /// select from other source like xls sheet
Encoder<CompanyRecord> comanyEncoder = Encoders.bean(CompanyRecord.class);
Dataset<CompanyRecord> inputDs = latestUpdatedDs.as(comanyEncoder );
inputDs
.write()
.format("org.apache.spark.sql.cassandra")
.option("table","company")
.option("keyspace", "ks_one")
.mode(SaveMode.Append)
.save();
Giving error like below
ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 176, Column 75: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 176, Column 75: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
Exception in thread "main" java.util.NoSuchElementException: Columns not found in table ks_one.company:
companyId, companyName
at com.datastax.spark.connector.SomeColumns.selectFrom(ColumnSelector.scala:44)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:385)
> Question:
Even though I am using annotation for mapping , why I am getting error ?
How to fix this without changing the Java Bean field names ( i.e. from companyId to company_id) ?
Related
I have a DataSourceV2Relation object and I would like to get the name of its table from spark catalog. spark.catalog.listTables() will list all the tables, but is there a way to get the specific table directly from the object?
PySpark
from pyspark.sql.catalog import Table
def get_t_object_by_name(t_name:str, db_name='default') -> Table:
catalog_t_list = spark.catalog.listTables('default')
return next((t for t in catalog_t_list if t.name == t_name), None)
Call as : get_t_object_by_name('my_table_name')
Result example : Table(name='my_table_name', database='default', description=None, tableType='MANAGED', isTemporary=False)
Table class definition : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Table.html
I have a spring boot java application that talks to cassandra .
However one of my queries is failing .
public class ParameterisedListItemRepository {
private PreparedStatement findByIds;
public ParameterisedListItemRepository(Session session, Validator validator, ParameterisedListMsisdnRepository parameterisedListMsisdnRepository ) {
this.findByIds = session.prepare("SELECT * FROM mep_parameterisedListItem WHERE id IN ( :ids )");
}
public List<ParameterisedListItem> findAll(List<UUID> ids){
List<ParameterisedListItem> parameterisedListItemList = new ArrayList<>();
BoundStatement stmt =this.findByIds.bind();
stmt.setList("ids", ids);
session.execute(stmt)
.all()
.stream()
.map(parameterisedListItemMapper)
.forEach(parameterisedListItemList::add);
return parameterisedListItemList;
}
}
the following is the stack trace
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.util.UUID
at com.datastax.driver.core.TypeCodec$AbstractUUIDCodec.serialize(TypeCodec.java:1626)
at com.datastax.driver.core.AbstractData.setList(AbstractData.java:358)
at com.datastax.driver.core.AbstractData.setList(AbstractData.java:374)
at com.datastax.driver.core.BoundStatement.setList(BoundStatement.java:681)
at com.openmind.primecast.repository.ParameterisedListItemRepository.findAll(ParameterisedListItemRepository.java:128)
at com.openmind.primecast.repository.ParameterisedListItemRepository$$FastClassBySpringCGLIB$$46ffc15e.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:738)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673)
at com.openmind.primecast.repository.ParameterisedListItemRepository$$EnhancerBySpringCGLIB$$b2db3c41.findAll(<generated>)
at com.openmind.primecast.service.impl.ParameterisedListItemServiceImpl.findByParameterisedList(ParameterisedListItemServiceImpl.java:102)
at com.openmind.primecast.web.rest.ParameterisedListItemResource.getParameterisedListItemsByParameterisedList(ParameterisedListItemResource.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Any idea what is going wrong. I know this query is the problem
SELECT * FROM mep_parameterisedListItem WHERE id IN ( :ids )
any idea how I can change the findAll function to achieve the query?
this is the table definition
CREATE TABLE "Openmind".mep_parameterisedlistitem (
id uuid PRIMARY KEY,
data text,
msisdn text,
ordernumber int,
parameterisedlist uuid
) WITH COMPACT STORAGE;
Thank you.
Without knowing the table schema, my guess is that a change was made to the table so the schema no longer match the bindings in the prepared statement.
A big part of the problem is your query with SELECT *. Our recommendation for best practice is to explicitly name all the columns you're retrieving from the table. By specifying the columns in your query, you avoid surprises when the table schema changes.
In this instance, either a new column was added or an old column was dropped. With the cached prepared statement, it was expecting one column type and got another -- the ArrayList doesn't match UUID.
The solution is to re-prepare the statement and name all the columns. Cheers!
This is my dataset :
Dataset<Row> myResult = pot.select(col("number")
, col("document")
, explode(col("mask")).as("mask"));
I need to now create a new dataset from the existing myResult . something like below:
Dataset<Row> myResultNew = myResult.select(col("number")
, col("name")
, col("age")
, col("class")
, col("mask");
name , age and class are created from column document from Dataset myResult .
I guess I can call functions on the column document and then perform any operation on that.
myResult.select(extract(col("document")));
private String extract(final Column document) {
//TODO ADD A NEW COLUMN nam, age, class TO THE NEW DATASET.
// PARSE DOCUMENT AND GET THEM.
XMLParser doc= (XMLParser) document // this doesnt work???????
}
My question is: document is of type column and I need to convert it into a different Object Type and parse it for extracting name , age ,class. How can I do that. document is an xml and i need to do parsing for getting the other 3 columns so cant avoid converting it to XML .
Converting the extract method into an UDF would be a solution that is as close as possible to what you are asking. An UDF can take the value of one or more columns and execute any logic with this input.
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;
[...]
UserDefinedFunction extract = udf(
(String document) -> {
List<String> result = new ArrayList<>();
XMLParser doc = XMLParser.parse(document);
String name = ... //read name from xml document
String age = ... //read age from xml document
String clazz = ... //read class from xml document
result.add(name);
result.add(age);
result.add(clazz);
return result;
}, DataTypes.createArrayType(DataTypes.StringType)
);
A restriction of UDFs is that they can only return one column. Therefore the function returns a String array that has to be unpacked afterwards.
Dataset<Row> myResultNew = myResult
.withColumn("extract", extract.apply(col("document"))) //1
.withColumn("name", col("extract").getItem(0)) //2
.withColumn("age", col("extract").getItem(1)) //2
.withColumn("class", col("extract").getItem(2)) //2
.drop("document", "extract"); //3
call the UDF and use the column that contains the xml document as parameter of the apply function
create the result columns out of the returned array from step 1
drop the intermediate columns
Note: the udf is executed once per row in the dataset. If the creation of the xml parser is expensive this might slow down the execution of the Spark job as one parser is instantiated per row. Due to the parallel nature of Spark it is not possible to reuse the parser for the next row. If this is an issue, another (at least in the Java world slightly more complex) option would be to use mapPartitions. Here one would not need one parser per row but only one parser per partition of the dataset.
A completely different approach would be to use spark-xml.
We have written a spark batch application (Spark version: 2.3.0). Code is as below.
Transformation: Dataset<CollectionFlattenedData> collectionDataDS = flatMap(Function which parse some files and return us the data set); This data set will have three types of data which are differentiated by column record type: 1,2,3.
Load to Temp table : collectionDataDS.createOrReplaceTempView(TEMP_TABLE); Creating temp view of dataset.
Action1: sparkSession.sql("INSERT INTO TABLE1 SELECT COL1,COL2,COL3 FROM TEMP_TABLE WHERE recordtype='1'"); hive query to load TABLE1 table from temp table.
Action2: sparkSession.sql("INSERT INTO TABLE2 SELECT COL4,COL5,COL6 FROM TEMP_TABLE WHERE recordtype='2'"); Hive query to load TABLE2 from temp table.
Action3: sparkSession.sql("INSERT INTO TABLE2 SELECT COL7,COL8,COL9 FROM TEMP_TABLE WHERE recordtype='3'"); hive query to load ERROR table
What is happening: Because we are running 3 queries which are nothing but separate actions flatmap transformation is called three times(one time for one action). But our requirement is we should call flatmap operation only once.
CollectionFlattenedData pojo code is something like this
public class CollectionFlattenedData implements Serializable {
private String recordtype;
private String COL1;
private String COL2;
private String COL3;
private String COL4;
private String COL5;
private String COL6;
private String COL7;
private String COL8;
private String COL9;
//getters and setters of all the columns
}
Is there anyway we can do this. Early response is highly appreciated.
We can process this in two ways but first identify the size of the "TEMP_TABLE".
If the size is in the order of your RAM i.e. if it is able to cache good amount of your TEMP table, then you can cache it & use it in the further calculations.(You can get the data quantity from UI)
The other way better way is just save the data into a permanent table.
You can just refer the next steps as usually.
When you use .createOrReplaceTempView(), you are giving a name to use it further in your spark sql like queries. It will not create any action on the resultant dataframe.
My Code:
finalJoined.show();
Encoder<Row> rowEncoder = Encoders.bean(Row.class);
Dataset<Row> validatedDS = finalJoined.map(row -> validationRowMap(row), rowEncoder);
validatedDS.show();
Map function :
public static Row validationRowMap(Row row) {
//PART-A validateTxn()
System.out.println("Inside map");
//System.out.println("Value of CIS_DIVISION is " + row.getString(7));
//1. CIS_DIVISION
if ((row.getString(7)) == null || (row.getString(7)).trim().isEmpty()) {
System.out.println("CIS_DIVISION cannot be blank.");
}
return row;
}
Output :
finalJoined Dataset<Row> is properly shown with all columns and rows with proper values, however validatedDS Dataset<Row>is shown with only one column with empty values.
*Expected output : *
validatedDS should also show same values as finalJoined dataset because I am only performing validation inside the map function and not changing the dataset itself.
Please let me know if you need more information.
Encoders.bean is intended for usage with Bean classes. Row is not one of these (doesn't define setter and getters for specific fields, only generic getters).
To return Row object you have to use RowEncoder and provide expected output schema.
Check for example Encoder for Row Type Spark Datasets