Between statement is not working on Hive Map column - Spark SQL - apache-spark

I am have following hive table. key column has map value(key-value pairs). I am executing spark sql query with between statement on key column, but it is returning null records.
+---------------+--------------+----------------------+---------+
| column_value | metric_name | key |key[0] |
+---------------+--------------+----------------------+---------+
| A37B | Mean | {0:"202009",1:"12"} | 202009 |
| ACCOUNT_ID | Mean | {0:"202009",1:"12"} | 202009 |
| ANB_200 | Mean | {0:"202009",1:"12"} | 202009 |
| ANB_201 | Mean | {0:"202009",1:"12"} | 202009 |
| AS82_RE | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR001 | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR001_RE | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR002 | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR002_RE | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR003 | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR004 | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR005 | Mean | {0:"202009",1:"12"} | 202009 |
| ATTR006 | Mean | {0:"202009",1:"12"} | 202008 |
I am running below spark sql query
SELECT column_value, metric_name,key FROM table where metric_name = 'Mean' and column_value IN ('ATTR003','ATTR004','ATTR005') and key[0] between 202009 and 202003
Query is not returning any records. Instead of between statement, if i use IN (202009,202007,202008,202006,202005,202004,202003) statement it is returning result.
Need help!

Try other way around between values. E.g. between 202003 and 202009.

Related

How to get all combinations from Sphinx search?

I search via Sphinx SQL as
SELECT * FROM sphinx.articles WHERE query='something everything;mode=all';
+----------+--------+-------------------------------+
| id | weight | query |
+----------+--------+-------------------------------+
| 2324266 | 2 | something everything;mode=all |
| 6997338 | 2 | something everything;mode=all |
| 12002597 | 2 | something everything;mode=all |
| 12543040 | 2 | something everything;mode=all |
| 16314547 | 2 | something everything;mode=all |
| 19094425 | 2 | something everything;mode=all |
| 21398510 | 2 | something everything;mode=all |
| 23020445 | 2 | something everything;mode=all |
| 23040584 | 2 | something everything;mode=all |
| 24059424 | 2 | something everything;mode=all |
| 26009287 | 2 | something everything;mode=all |
| 27476187 | 2 | something everything;mode=all |
| 30488694 | 2 | something everything;mode=all |
| 30698992 | 2 | something everything;mode=all |
| 33191618 | 2 | something everything;mode=all |
| 33900227 | 2 | something everything;mode=all |
| 35671048 | 2 | something everything;mode=all |
| 39324937 | 2 | something everything;mode=all |
| 40373341 | 2 | something everything;mode=all |
| 40391221 | 2 | something everything;mode=all |
+----------+--------+-------------------------------+
20 rows in set (0.233 sec)
and get the total results by
SHOW STATUS LIKE 'Sphinx_total_found';
+--------------------+--------+
| Variable_name | Value |
+--------------------+--------+
| Sphinx_total_found | 356179 |
+--------------------+--------+
1 row in set (0.004 sec)
I wonder if it is possible to get all possible combinations indexed by Sphinx?
For example, to get the number of results for all combinations of two keywords as
+-------------------+-------------------------------+
| query | total_results |
+-------------------+-------------------------------+
| something everything;mode=all | 58844 |
| word1 word2;mode=all | 11 |
| word1 word3;mode=all | 234 |
| word2 word3;mode=all | 663 |
| word2 word4;mode=all | 9115 |
+-------------------+-------------------------------+
I understand that Sphinx dynamically finds the results by the given keywords in the query, but theoretically, all indexed keywords are known, and we can make the combinations. However, it is too slow if we do the queries individually.

Error while querying hive table with map datatype in Spark SQL. But working while executing in HiveQL

I have hive table with below structure
+---------------+--------------+----------------------+
| column_value | metric_name | key |
+---------------+--------------+----------------------+
| A37B | Mean | {0:"202006",1:"1"} |
| ACCOUNT_ID | Mean | {0:"202006",1:"2"} |
| ANB_200 | Mean | {0:"202006",1:"3"} |
| ANB_201 | Mean | {0:"202006",1:"4"} |
| AS82_RE | Mean | {0:"202006",1:"5"} |
| ATTR001 | Mean | {0:"202007",1:"2"} |
| ATTR001_RE | Mean | {0:"202007",1:"3"} |
| ATTR002 | Mean | {0:"202007",1:"4"} |
| ATTR002_RE | Mean | {0:"202007",1:"5"} |
| ATTR003 | Mean | {0:"202008",1:"3"} |
| ATTR004 | Mean | {0:"202008",1:"4"} |
| ATTR005 | Mean | {0:"202008",1:"5"} |
| ATTR006 | Mean | {0:"202009",1:"4"} |
| ATTR006 | Mean | {0:"202009",1:"5"} |
I need to write a spark sql query to filter based on Key column with NOT IN condition with commination of both keys.
The following query works fine in HiveQL in Beeline
select * from your_data where key[0] between '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );
But when i try the same query in Spark SQL. I am getting error
cannot resolve due to data type mismatch: map<int,string>
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:115)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
Please help!
I got the answer from different question which i raised before. This query is working fine
select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );

Remove groups from pandas where {condition}

I have dataframe like this:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 1 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00002 |
| 2 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.00004 |
| 3 | d55edb65-dc77-41d0-bb53-43cf01376a04 | CMN.11001 |
| 4 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00002 |
| 5 | 5cb72b9c-adb8-4e1c-9296-db2080cb3b6d | CMN.00001 |
| 6 | f4260b99-6579-4607-bfae-f601cc13ff0c | CMN.00202 |
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 8 | fee98470-aa8f-4ec5-8bcd-1683f85727c2 | TKP.00001 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
I've grouped it with grouped = df.groupby('envelopeid')
And I need to remove all groups from the dataframe and stay only that groups that have messages (CMN.00002) or (CMN.00002 and CMN.00004) only.
Desired dataframe:
+---+--------------------------------------+-----------+
| | envelopeid | message |
+---+--------------------------------------+-----------+
| 7 | 8f673ae3-0293-4aca-ad6b-572f138515e6 | CMN.00002 |
| 9 | 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00002 |
| 10| 88926399-3697-4e15-8d25-6cb37a1d250e | CMN.00004 |
+---+--------------------------------------+-----------+
tried
(grouped.message.transform(lambda x: x.eq('CMN.00001').any() or (x.eq('CMN.00002').any() and x.ne('CMN.00002' or 'CMN.00004').any()) or x.ne('CMN.00002').all()))
but it is not working properly
Try:
grouped = df.loc[df['message'].isin(['CMN.00002', 'CMN.00002', 'CMN.00004'])].groupby('envelopeid')
Try this: df[df.message== 'CMN.00002']
outdf = df.groupby('envelopeid').filter(lambda x: tuple(x.message)== ('CMN.00002',) or tuple(x.message)== ('CMN.00002','CMN.00004'))
So i figured it up.
resulting dataframe will got only groups that have only CMN.00002 message or CMN.00002 and CMN.00004. This is what I need.
I used filter instead of transform.

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance')
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_FACTOR | 0.033572 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?
importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

SSIS Convert column to rows from an excel sheet

I have an excel sheet table with a structure like this:
+------------+-----+----------+----------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+
| date | Day | StoreDdg | StoreR/H | DbgCategory1Dpt1 | R/HCategory1Dpt1 | DbgCategory2Dpt1 | R/HCategory2Dpt1 | DbgCategory3Dpt1 | R/HCategory2Dpt1 | DbgDepartment1 | R/HDepartment1 | DbgCategory1Dpt2 | R/HCategory1Dpt2 | DbgCategory2Dpt2 | R/HCategory2Dpt2 | DbgCategory3Dpt2 | R/HCategory2Dpt2 | DbgDepartment2 | R/HDepartment2 |
+------------+-----+----------+----------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+
| 1-Jan-2017 | Sun | 138,894 | 133% | 500 | 44% | 12,420 | 146% | | | | 11,920 | 104% | #DIV/0! | 13,580 | 113% | 9,250 | 92% | 6,530 | 147% |
| 2-Jan-2017 | Mon | 138,894 | 270% | 500 | 136% | 12,420 | 277% | 11,920 | | | | 193% | #DIV/0! | 13,580 | 299% | 9,250 | 225% | 6,530 | 181% |
+------------+-----+----------+----------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+----------------+
I would like to convert this into
+------------+-----+--------+-------------+---------------+---------+------+
| date | Day | Store | Department | Category | Dpt | R/H |
+------------+-----+--------+-------------+---------------+---------+------+
| 1-Jan-2017 | Sun | Store1 | Department1 | Category1Dpt1 | 138,894 | 133% |
| 1-Jan-2017 | Sun | Store1 | Department1 | Category2Dpt1 | 500 | 44% |
| 1-Jan-2017 | Sun | Store1 | Department1 | Category3Dpt1 | 12,420 | 146% |
| 1-Jan-2017 | Sun | Store1 | Department2 | Category1Dpt2 | 11,920 | 104% |
| 1-Jan-2017 | Sun | Store1 | Department2 | Category2Dpt2 | 13,580 | 44% |
| 1-Jan-2017 | Sun | Store1 | Department2 | Category3Dpt2 | 9,250 | 92% |
| 2-Jan-2017 | Mon | Store1 | Department1 | Category1Dpt1 | 138,894 | 270% |
| 2-Jan-2017 | Mon | Store1 | Department1 | Category2Dpt1 | 500 | 136% |
| 2-Jan-2017 | Mon | Store1 | Department1 | Category3Dpt1 | 12,420 | 277% |
| 2-Jan-2017 | Mon | Store1 | Department2 | Category1Dpt2 | 13,580 | 299% |
| 2-Jan-2017 | Mon | Store1 | Department2 | Category2Dpt2 | 9,250 | 225% |
| 2-Jan-2017 | Mon | Store1 | Department2 | Category3Dpt2 | 6,530 | 181% |
+------------+-----+--------+-------------+---------------+---------+------+
any recommendation about how to do this?
You can do this by taking the excel file as source. You might have to save as the excel in 2005 or 2007 format depending upon the version you are using of the visual studio if it is already in 2007 format then its good .
Now extracting the data for DbgDepartment1 and DbgDepartment2 , you may create 2 different source in the DFT. In one , you may select column which are related to DbgDepartment1 and in the second ,you may choose DbgDepartment2. You might have to use the Derived Column depending on the logic you will use further . Then you may use the Union Transformation, as the source file is the same and can load the data into the destination .Try it , you will get a solution .
I used R statistic language to solve this issue by using data tidying packages ("tidyr", "devtools")
for more info check the link: http://garrettgman.github.io/tidying/

Resources