I'm using the Azure Kusto Java SDK v2.0.1 with Scala over Java8.
I'm executing some query:
val query = " ... "
val tenantId = " ... "
val queryResponse = client.execute(tenantId, query)
val queryResponseResults = queryResponse.getPrimaryResults
I want to convert the given data structure to JSON eventually, so I want to get all columns, but I can't seem to find some kind of getColumns.
While debugging I see the object (KustoResultSetTable) has fields columnsAsArray (which is exactly what I want) and columns - but they are private and I didn't find any getters.
A getter will be added in the next version
Related
We have an HDInsight cluster running HBase (Ambari)
We have created a table using Pheonix
CREATE TABLE IF NOT EXISTS Results (Col1 VARCHAR(255) NOT NULL,Col1
INTEGER NOT NULL ,Col3 INTEGER NOT NULL,Destination VARCHAR(255)
NOT NULL CONSTRAINT pk PRIMARY KEY (Col1,Col2,Col3) )
IMMUTABLE_ROWS=true
We have filled some data into this table (using some java code)
Later, we decided we want to create a local index on the destination column as follows
CREATE LOCAL INDEX DESTINATION_IDX ON RESULTS (destination) ASYNC
We have run the index tool to fill the index as follows
hbase org.apache.phoenix.mapreduce.index.IndexTool --data-table
RESULTS --index-table DESTINATION_IDX --output-path
DESTINATION_IDX_HFILES
When we run queries and filter using the destination columns everything is ok. For example
select /*+ NO_CACHE, SKIP_SCAN */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
But, if we do not use the DESTINATION in the where query, then we will get a NullPointerException in BaseResultIterators.class
(from phoenix-core-4.7.0-HBase-1.1.jar)
This exception is thrown only when we use the new local index. If we query ignoring the index like this
select /*+ NO_CACHE, SKIP_SCAN ,NO_INDEX */ COL1,COL2,COL3,DESTINATION from
Results where COL1='data' DESTINATION='some value' ;
we will not get the exception
Showing some relevant code from the area where we get the exception
...
catch (StaleRegionBoundaryCacheException e2) {
// Catch only to try to recover from region boundary cache being out of date
if (!clearedCache) { // Clear cache once so that we rejigger job based on new boundaries
services.clearTableRegionCache(physicalTableName);
context.getOverallQueryMetrics().cacheRefreshedDueToSplits();
}
// Resubmit just this portion of work again
Scan oldScan = scanPair.getFirst();
byte[] startKey = oldScan.getAttribute(SCAN_ACTUAL_START_ROW);
byte[] endKey = oldScan.getStopRow();
====================Note the isLocalIndex is true ==================
if (isLocalIndex) {
endKey = oldScan.getAttribute(EXPECTED_UPPER_REGION_KEY);
//endKey is null for some reason in this point and the next function
//will fail inside it with NPE
}
List<List<Scan>> newNestedScans = this.getParallelScans(startKey, endKey);
We must use this version of the Jar since we run inside Azure HDInsight
and we can not select a newer jar version
Any ideas how to solve this?
What does "recover from region boundary cache being out of date" mean? it seems to be related to the problem
It appears that the version that azure HDInsight has for phoenix core (phoenix-core-4.7.0.2.6.5.3004-13.jar) has the bug but if i use a bit newer version (phoenix-core-4.7.0.2.6.5.8-2.jar, from http://nexus-private.hortonworks.com:8081/nexus/content/repositories/hwxreleases/org/apache/phoenix/phoenix-core/4.7.0.2.6.5.8-2/) we do not see the bug any more
note that it is not possible to take a much newer version like 4.8.0 since in this case the server will throw a version mismatch error
I have a requirement to get the where condition passed by user as program arguments. Based on the where condition i need to query the source data base.
I am using spark-sql.2.3.1
How to construct and pass/executive dynamically build query?
Sample query:
select ProductId, COUNT(*) AS ProductSaleCount
from productsale
where to_date(Date) >= "2015-12-17"
and to_date(Date) <= "2015-12-31"
group by ProductId
All you have to do in your scenario is create a query string which would go something like:
val query = "select ProductId, COUNT(*) AS ProductSaleCount from productsale where to_date(Date) >= "+ fromDate +" and to_date(Date) <= " + toDate + " group by ProductId"
the fromDate and toDate, you would get from your arguments, perhaps.
To use this, however is a different issue and it depends on your database
For hive you can simply register your spark session with enableHiveSupport
val spark = SparkSession.builder().appName("My App").enableHiveSupport().config("spark.sql.warehouse.dir", warehouseLocation).getOrCreate()
val data = spark.sqlContext.sql(query)
If the data is in a dataframe and you want to query that, you would have to create a view and then run your query on that
finalDataFrame.createOrReplaceTempView("productsale")
val data = spark.sqlContext.sql(query)
Hope this helps
I'm using Google's official Spark-BigQuery connector (com.google.cloud.bigdataoss:bigquery-connector:hadoop2-0.13.6) to retrieve data from BigQuery on a huge time-partitioned table (field myDateField).
So I'm currently doing this (example adapted from the docs) to retrieve recent data (less than a month) :
val config = sparkSession.sparkContext.hadoopConfiguration
config.set(BigQueryConfiguration.GCS_BUCKET_KEY, "mybucket")
val fullyQualifiedInputTableId = "project:dataset.table"
BigQueryConfiguration.configureBigQueryInput(config, fullyQualifiedInputTableId)
val bigQueryRDD: RDD[(LongWritable, JsonObject)] = sparkSession.sparkContext.newAPIHadoopRDD(
config,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject]
)
val convertedRDD: RDD[MyClass] = bigQueryRDD.map { case (_, jsonObject) =>
convertJsonObjectToMyClass(jsonObject)
}
val recentData: RDD[MyClass] = convertedRDD.filter { case MyClass(_, myDateField) =>
myDateField >= "2018-08-10"
}
println(recentData.count())
Questions
I'm wondering if the connector queries all data from the BigQuery table, like :
SELECT *
FROM `project.dataset.table`
Or if it does something clever (and more important, less expensive) that use partitioning like :
SELECT *
FROM `project.dataset.table`
WHERE myDateField >= TIMESTAMP("2018-08-10")
moreover, in general, how can I control the costs of a query and be sure that irrelevant data (here, data before "2018-08-10" for example) is not retrieved for nothing?
in case BigQuery retrieves all data, can I provide a specific query? BigQueryConfiguration.INPUT_QUERY_KEY (mapred.bq.input.query) is deprecated, but I don't see any replacement and the docs are not very clear on that
So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));
I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.