qf in solr search query not working as expected - search
In my schema.xml, entry for all fields is as follows:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="producturl" type="string" indexed="false" stored="true" required="true"/>
<field name="imageurl" type="string" indexed="false" stored="true" />
<field name="category" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="categorypath" type="lowercase" indexed="true" stored="true" />
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<field name="inStock" type="boolean" indexed="true" stored="true" />
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="store" type="text_general" indexed="true" stored="true" omitNorms="true"/>
and I'm trying the search in which search string if exist in category must come above the resule where search string exist in name. I tried the following query but I'm always getting the result where query sting is present in name.
http:///solr//select?debugQuery=on&indent=on&q=query:"{!edismax mm='100%25' qf='manu^10.0' v=$q1}" OR query:"{!edismax mm='100%25' qf='name^0.2' v=$q1}" OR query:"{!edismax mm='100%25' qf='category^10.0' v=$q1}" OR query:"{!edismax mm='100%25' qf='categorypath^5.0' v=$q1}"&q1=mobile&wt=csv
Result snippet for above query is as follows:
mrp_d,manu,description,discount_i,store,_version_,price,categorypath,imageurl,name,id,category,producturl,last_modified
60.68,Not Specified,,50,Lazada,1549123070131699713,30.34,Women|Wallets & Accessories,http://srv-live.lazada.com.my/p/image-78714241-16244bcc8b6488fa7c5dc3479b6b6f69-product.jpg,Fashion Women Lady Cute Cat Clutch Long Purse Wallet Card Holder Mobile Bag Green,LZDNO037FAAA8H917ANMY-18042381,Wallets & Accessories,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Ffashion-women-lady-cute-cat-clutch-long-purse-wallet-card-holder-mobile-bag-green-14241787.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-10-25T01:27:53Z
60.3,Not Specified,,50,Lazada,1549123070201954306,30.15,Women|Wallets & Accessories,http://srv-live.lazada.com.my/p/image-53144241-27196219e745476ccd5b10cb0febf37b-product.jpg,Fashion Women Flower Long Zip Wallet Card Holder Mobile Bag Clutch Purse Pockets Navy Blue,LZDNO037FAAA8HAUFANMY-18045720,Wallets & Accessories,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Ffashion-women-flower-long-zip-wallet-card-holder-mobile-bag-clutch-purse-pockets-navy-blue-14244135.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-10-25T01:27:53Z
84.66,Not Specified,,66,Lazada,1549123070268014594,28.22,Women|Bags & Clutches|Handbags,http://srv-live.lazada.com.my/p/image-63215241-13718b9bf73f9ba09610037ef9ac8dad-product.jpg,ERA New Women Vintage Leather Coin Cell Phone Mobile Mini Cross-body Shoulder Bag,LZDNO037FAAA8HGBOANMY-18055171,Handbags,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Fera-new-women-vintage-leather-coin-cell-phone-mobile-mini-cross-body-shoulder-bag-14251236.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-10-25T01:27:53Z
84.66,Not Specified,,66,Lazada,1549123070268014595,28.22,Women|Bags & Clutches|Handbags,http://srv-live.lazada.com.my/p/image-93215241-08f2151a427d52295328a5d7e5addce1-product.jpg,ERA New Women Vintage Leather Coin Cell Phone Mobile Mini Cross-body Shoulder Bag,LZDNO037FAAA8HGBRANMY-18055174,Handbags,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Fera-new-women-vintage-leather-coin-cell-phone-mobile-mini-cross-body-shoulder-bag-14251239.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-10-25T01:27:53Z
What am I doing wrong here? How can I fine tune the results?
Thanks in advance for help.
Edit (xml results with debug results):
Query:
http:///solr//select?debugQuery=on&defType=edismax&indent=on&mm=3&ps=10&q=mobile&qf=category^10.0%20manu^10.0%20categorypath^5.0%20name^0.2&wt=xml
Complete response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">26</int>
<lst name="params">
<str name="mm">3</str>
<str name="q">mobile</str>
<str name="defType">edismax</str>
<str name="ps">10</str>
<str name="indent">on</str>
<str name="qf">category^10.0 manu^10.0 categorypath^5.0 name^0.2</str>
<str name="wt">xml</str>
<str name="debugQuery">on</str>
<str name="_">1477372296342</str>
</lst>
</lst>
<result name="response" numFound="12450223" start="0">
<doc>
<str name="description">This vertical case is made from PU leather material, and specially designed for your smartphone. The case features card slot and earphone hole for convenient use. It can be used as a shoulder bag or holster case. ? Made from PU material ? Come with two card slot ? Used as a shoulder bag and holster case ? Provides all-around protection</str>
<int name="discount_i">40</int>
<double name="mrp_d">990.0</double>
<str name="store">Flipkart</str>
<float name="price">490.0</float>
<str name="categorypath">Mobiles & Tablets|Mobile Accessories|Cases & Covers</str>
<str name="imageurl">http://img5a.flixcart.com/image/cases-covers/holster/h/c/x/jojo-jojo288086-400x400-imae822ffdyrhgmg.jpeg</str>
<str name="name">Jojo Holster for Fly Mobile MV 247</str>
<str name="id">ACCE9GQ3BMCHZXYY</str>
<str name="category">Cases & Covers</str>
<str name="manu">Jojo</str>
<str name="producturl">http://dl.flipkart.com/dl/jojo-holster-fly-mobile-mv-247/p/itme9gq44hz9dtjh?pid=ACCE9GQ3BMCHZXYY</str>
<date name="last_modified">2016-10-21T07:40:13Z</date>
<long name="_version_">1548784125563895810</long></doc>
<doc>
<str name="description">Mobile Cover Shop having numerous designs of exclusive range for mobile covers, back cover and protective cover and cases. Each cover is made from a special strong and durable plastic material, which provides your gadget with long lasting protection. We suggest you to keep this case in your collection for being in sync with latest trend in fashion.</str>
<int name="discount_i">50</int>
<double name="mrp_d">1200.0</double>
<str name="store">Flipkart</str>
<float name="price">599.0</float>
<str name="categorypath">Mobiles & Tablets|Mobile Accessories|Cases & Covers</str>
<str name="imageurl">http://img6a.flixcart.com/image/cases-covers/back-cover/p/k/e/mobile-cover-shop-mcs24354-400x400-imae8x3ydhwu6jrd.jpeg</str>
<str name="name">Mobile Cover Shop Back Cover for Lenovo K920</str>
<str name="id">ACCE8Y3U5HFZVPKE</str>
<str name="category">Cases & Covers</str>
<str name="manu">Mobile Cover Shop</str>
<str name="producturl">http://dl.flipkart.com/dl/mobile-cover-shop-back-lenovo-k920/p/itme8y3utmh9jzaa?pid=ACCE8Y3U5HFZVPKE</str>
<date name="last_modified">2016-10-21T07:40:13Z</date>
<long name="_version_">1548784125873225732</long></doc>
<doc>
<str name="description">Mobile Cover Shop having numerous designs of exclusive range for mobile covers, back cover and protective cover and cases. Each cover is made from a special strong and durable plastic material, which provides your gadget with long lasting protection. We suggest you to keep this case in your collection for being in sync with latest trend in fashion.</str>
<int name="discount_i">50</int>
<double name="mrp_d">1200.0</double>
<str name="store">Flipkart</str>
<float name="price">599.0</float>
<str name="categorypath">Mobiles & Tablets|Mobile Accessories|Cases & Covers</str>
<str name="imageurl">http://img6a.flixcart.com/image/cases-covers/back-cover/d/s/u/mobile-cover-shop-mcs23176-400x400-imae8p53hgwggnwb.jpeg</str>
<str name="name">Mobile Cover Shop Back Cover for Samsung Galaxy S5</str>
<str name="id">ACCE8Y3TJPXAFDSU</str>
<str name="category">Cases & Covers</str>
<str name="manu">Mobile Cover Shop</str>
<str name="producturl">http://dl.flipkart.com/dl/mobile-cover-shop-back-samsung-galaxy-s5/p/itme8y3tdjn4nqtj?pid=ACCE8Y3TJPXAFDSU</str>
<date name="last_modified">2016-10-21T07:40:14Z</date>
<long name="_version_">1548784126891393025</long></doc>
<doc>
<str name="description">Mobile Cover Shop having numerous designs of exclusive range for mobile covers, back cover and protective cover and cases. Each cover is made from a special strong and durable plastic material, which provides your gadget with long lasting protection. We suggest you to keep this case in your collection for being in sync with latest trend in fashion.</str>
<int name="discount_i">50</int>
<double name="mrp_d">1200.0</double>
<str name="store">Flipkart</str>
<float name="price">599.0</float>
<str name="categorypath">Mobiles & Tablets|Mobile Accessories|Cases & Covers</str>
<str name="imageurl">http://img5a.flixcart.com/image/cases-covers/back-cover/z/g/e/mobile-cover-shop-mcs23352-400x400-imae8p544uhatwtc.jpeg</str>
<str name="name">Mobile Cover Shop Back Cover for Samsung Galaxy Note 2 N7100</str>
<str name="id">ACCE8Y3TGHAK2ZGE</str>
<str name="category">Cases & Covers</str>
<str name="manu">Mobile Cover Shop</str>
<str name="producturl">http://dl.flipkart.com/dl/mobile-cover-shop-back-samsung-galaxy-note-2-n7100/p/itme8y3tcugyxmeb?pid=ACCE8Y3TGHAK2ZGE</str>
<date name="last_modified">2016-10-21T07:40:15Z</date>
<long name="_version_">1548784128432799747</long></doc>
<doc>
<str name="description">Mobile Cover Shop having numerous designs of exclusive range for mobile covers, back cover and protective cover and cases. Each cover is made from a special strong and durable plastic material, which provides your gadget with long lasting protection. We suggest you to keep this case in your collection for being in sync with latest trend in fashion.</str>
<int name="discount_i">50</int>
<double name="mrp_d">1200.0</double>
<str name="store">Flipkart</str>
<float name="price">599.0</float>
<str name="categorypath">Mobiles & Tablets|Mobile Accessories|Cases & Covers</str>
<str name="imageurl">http://img6a.flixcart.com/image/cases-covers/back-cover/x/3/q/mobile-cover-shop-mcs22843-400x400-imae8p53feah3b2q.jpeg</str>
<str name="name">Mobile Cover Shop Back Cover for Apple iPhone 6 Plus</str>
<str name="id">ACCE8Y3TM8YWHX3Q</str>
<str name="category">Cases & Covers</str>
<str name="manu">Mobile Cover Shop</str>
<str name="producturl">http://dl.flipkart.com/dl/mobile-cover-shop-back-apple-iphone-6-plus/p/itme8y3twsyngygp?pid=ACCE8Y3TM8YWHX3Q</str>
<date name="last_modified">2016-10-21T07:40:15Z</date>
<long name="_version_">1548784128862715904</long></doc>
</result>
<lst name="debug">
<str name="rawquerystring">mobile</str>
<str name="querystring">mobile</str>
<str name="parsedquery">(+DisjunctionMaxQuery(((manu:mobile)^10.0 | (name:mobile)^0.2 | (category:mobile)^10.0 | (categorypath:mobile)^5.0)))/no_coord</str>
<str name="parsedquery_toString">+((manu:mobile)^10.0 | (name:mobile)^0.2 | (category:mobile)^10.0 | (categorypath:mobile)^5.0)</str>
<lst name="explain">
<str name="ACCE9GQ3BMCHZXYY">
0.9692176 = max of:
0.9692176 = weight(name:mobile in 311) [], result of:
0.9692176 = score(doc=311,freq=1.0 = termFreq=1.0
), product of:
0.2 = boost
4.5229306 = idf(docFreq=177649, docCount=16362429)
1.0714487 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
8.496009 = avgFieldLength
7.111111 = fieldLength
</str>
<str name="ACCE8Y3U5HFZVPKE">
62.168762 = max of:
62.168762 = weight(manu:mobile in 372) [], result of:
62.168762 = score(doc=372,freq=1.0 = termFreq=1.0
), product of:
10.0 = boost
6.216876 = idf(docFreq=32829, docCount=16452022)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
0.8345084 = weight(name:mobile in 372) [], result of:
0.8345084 = score(doc=372,freq=1.0 = termFreq=1.0
), product of:
0.2 = boost
4.5229306 = idf(docFreq=177649, docCount=16362429)
0.92253065 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
8.496009 = avgFieldLength
10.24 = fieldLength
</str>
<str name="ACCE8Y3TJPXAFDSU">
62.168762 = max of:
62.168762 = weight(manu:mobile in 570) [], result of:
62.168762 = score(doc=570,freq=1.0 = termFreq=1.0
), product of:
10.0 = boost
6.216876 = idf(docFreq=32829, docCount=16452022)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
0.8345084 = weight(name:mobile in 570) [], result of:
0.8345084 = score(doc=570,freq=1.0 = termFreq=1.0
), product of:
0.2 = boost
4.5229306 = idf(docFreq=177649, docCount=16362429)
0.92253065 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
8.496009 = avgFieldLength
10.24 = fieldLength
</str>
<str name="ACCE8Y3TGHAK2ZGE">
62.168762 = max of:
62.168762 = weight(manu:mobile in 819) [], result of:
62.168762 = score(doc=819,freq=1.0 = termFreq=1.0
), product of:
10.0 = boost
6.216876 = idf(docFreq=32829, docCount=16452022)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
0.6644898 = weight(name:mobile in 819) [], result of:
0.6644898 = score(doc=819,freq=1.0 = termFreq=1.0
), product of:
0.2 = boost
4.5229306 = idf(docFreq=177649, docCount=16362429)
0.7345788 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
8.496009 = avgFieldLength
16.0 = fieldLength
</str>
<str name="ACCE8Y3TM8YWHX3Q">
62.168762 = max of:
62.168762 = weight(manu:mobile in 886) [], result of:
62.168762 = score(doc=886,freq=1.0 = termFreq=1.0
), product of:
10.0 = boost
6.216876 = idf(docFreq=32829, docCount=16452022)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
0.8345084 = weight(name:mobile in 886) [], result of:
0.8345084 = score(doc=886,freq=1.0 = termFreq=1.0
), product of:
0.2 = boost
4.5229306 = idf(docFreq=177649, docCount=16362429)
0.92253065 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
8.496009 = avgFieldLength
10.24 = fieldLength
</str>
</lst>
<str name="QParser">ExtendedDismaxQParser</str>
<null name="altquerystring"/>
<null name="boost_queries"/>
<arr name="parsed_boost_queries"/>
<null name="boostfuncs"/>
<lst name="timing">
<double name="time">26.0</double>
<lst name="prepare">
<double name="time">0.0</double>
<lst name="query">
<double name="time">0.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="facet_module">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="expand">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">0.0</double>
</lst>
</lst>
<lst name="process">
<double name="time">25.0</double>
<lst name="query">
<double name="time">0.0</double>
</lst>
<lst name="facet">
<double name="time">0.0</double>
</lst>
<lst name="facet_module">
<double name="time">0.0</double>
</lst>
<lst name="mlt">
<double name="time">0.0</double>
</lst>
<lst name="highlight">
<double name="time">0.0</double>
</lst>
<lst name="stats">
<double name="time">0.0</double>
</lst>
<lst name="expand">
<double name="time">0.0</double>
</lst>
<lst name="debug">
<double name="time">24.0</double>
</lst>
</lst>
</lst>
</lst>
</response>
As identified, before optimising on the query part of search multiple changes were required in the schema.xml. This link contains detailed information on what I was searching:
http://apprize.info/php/solr_4/6.html
Edit2:
As I was trying to search only mobile whereas category for mobiles was defined as mobilephones in solr data. So, I changed the manufacturer and category fields in schema.xml as follows:
<!--<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="false"/>-->
<field name="manu" type="text_synonyms_brand" indexed="true" stored="true" omitNorms="false"/>
<!--<field name="category" type="text_general" indexed="true" stored="true" omitNorms="false"/>-->
<field name="category" type="text_synonyms_category" indexed="true" stored="true" omitNorms="false"/>
and configured respective field-types as follows:
<fieldType name="text_synonyms_brand" class="solr.TextField" positionIncrementGap="100">
<!--<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>-->
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" splitOnNumerics="0" catenateWords="0" catenateNumbers="0" catenateAll="1" stemEnglishPossessive="0" splitOnCaseChange="0"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_brand.txt" ignoreCase="true" expand="true"/> <!-- tokenizerFactory="solr.KeywordTokenizerFactory"/> -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
<fieldType name="text_synonyms_category" class="solr.TextField" positionIncrementGap="100">
<!--<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>-->
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" splitOnNumerics="0" catenateWords="0" catenateNumbers="0" catenateAll="1" stemEnglishPossessive="0" splitOnCaseChange="0"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_category.txt" ignoreCase="true" expand="true"/> <!-- tokenizerFactory="solr.KeywordTokenizerFactory"/>-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
Then in synonyms*.txt I put up few synonyms as follows:
# cat synonyms_category.txt
Iphone=>mobilephones
mobile=>mobilephones
phone=>mobilephones
galaxy=>mobilephones
# cat synonyms_brand.txt
Iphone=>apple
galaxy=>samsung
After restarting solr (as index time analyzer I commented out), I'm getting the correct results.
Few examples are:
for mobile:
Query:
http://<solr-url>/solr/<core>/select?debugQuery=on&indent=on&q=(_query_:"{!edismax mm='100%25' qf='manu^2.0' v=$q1}") OR (_query_:"{!edismax mm='100%25' qf='category^2.0' v=$q1}")&q1=mobile&wt=csv
Result:
mrp_d,manu,description,discount_i,store,_version_,price,categorypath,imageurl,name,id,category,producturl,last_modified
782.0,OEM,,55,Lazada,1549960720168255490,350.0,Mobiles & Tablets|Mobile Phones,http://srv-live.lazada.com.my/p/image-8341802-1-product.jpg,Doogee Voyager2 DG310 5inch Smartphone Android 4.4 Quad Core MT6582 1.3GHZ 3G 1-8GB,LZDOE702ELBSM1QANMY-2279016,Mobile Phones,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Fdoogee-voyager2-dg310-5inch-smartphone-android-44-quad-core-mt6582-13ghz-3g-1-8gb-2081438.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-11-03T07:21:57Z
39.0,OEM,,50,Lazada,1549960721000824834,19.4,Mobiles & Tablets|Mobile Phones,http://srv-live.lazada.com.my/p/image-6515112-1-product.jpg,Retro Phone Handset for iPhone 4/4S,LZDOE702ELBTC2CANMY-2320607,Mobile Phones,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Fretro-phone-handset-for-iphone-44s-2115156.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-11-03T07:21:58Z
for iphone:
Query:
http://<solr-url>/solr/<core>/select?debugQuery=on&indent=on&q=(_query_:"{!edismax mm='100%25' qf='manu^2.0' v=$q1}") AND (_query_:"{!edismax mm='100%25' qf='category^2.0' v=$q1}")&q1=iphone&wt=csv
Result:
mrp_d,manu,description,discount_i,store,_version_,price,categorypath,imageurl,name,id,category,producturl,last_modified
3199.0,Apple,,28,Lazada,1549961206837542912,2299.0,Mobiles & Tablets|Mobile Phones,http://srv-live.lazada.com.my/p/image-9072684-1-product.jpg,Apple iPhone 6s 16GB (Gold) Malaysia Set,LZDAP564ELAA2W839ANMY-5749957,Mobile Phones,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2Fapple-iphone-6s-16gb-gold-malaysia-set-4862709.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-11-03T07:29:42Z
1499.0,Apple,,52,Lazada,1549961206837542915,705.0,Mobiles & Tablets|Mobile Phones,http://srv-live.lazada.com.my/p/image-0684205-490901cdc6628dd90beefa8e05173cdd-product.jpg,100% New Condition (Refurbished) Apple iPhone 5 16GB (White)[Grade A],LZDAP564ELAA2ZP7GANMY-5976742,Mobile Phones,http://ho.lazada.com.my/SHHSoo?url=http%3A%2F%2Fwww.lazada.com.my%2F100-new-condition-refurbished-apple-iphone-5-16gb-whitegrade-a-5024860.html%3Foffer_id%3D%7Boffer_id%7D%26affiliate_id%3D%7Baffiliate_id%7D%26offer_name%3D%7Boffer_name%7D_%7Boffer_file_id%7D%26affiliate_name%3D%7Baffiliate_name%7D%26transaction_id%3D%7Btransaction_id%7D&aff_sub=&aff_sub2=&aff_sub3=&aff_sub4=&aff_sub5=,2016-11-03T07:29:42Z
Related
BeanIo/xsd not throwing exceptions with blank file / no records present
I want Bean.Io mapping to through exception when records are not present in file(Blank File). But it's not happening.Though it has validation occurs="0+" in place . Also tried minOccurs=1 maxOccurs=unbounded My mapping file <?xml version="1.0" encoding="UTF-8"?> <beanio xmlns="http://www.beanio.org/2012/03"> <stream name="Records" format="fixedlength" strict="true"> <record name="SampleRecord" class="com.test.SampleRecord" **occurs="0+"**> <field name="mobileNumber" type="string" position="0" length="10" regex="[0-9]*" required="true"/> <field name="alternateMobileNumber" type="string" position="10" length="20" regex="[0-9]*" required="false"/> </record> </stream> </beanio>
You can try this mapping.xml file: <?xml version="1.0" encoding="UTF-8"?> <beanio xmlns="http://www.beanio.org/2012/03" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.beanio.org/2012/03 http://www.beanio.org/2012/03/mapping.xsd"> <stream name="Records" format="fixedlength" strict="true" minOccurs="1"> <record name="SampleRecord" class="com.test.SampleRecord" occurs="0+"> <field name="mobileNumber" type="string" position="0" length="10" regex="[0-9]*" required="true"/> <field name="alternateMobileNumber" type="string" position="10" length="20" regex="[0-9]*" required="false"/> </record> </stream> </beanio> Note the attribute minOccurs="1" on the stream element. The documentation states this: minOccurs - The minimum number of times the record layout must be read from an input stream. Defaults to 0. Thus, changing minOccurs to 1 causes BeanIO to throw an exception with an empty string as input.
How to parse nested XML inside textfile using Spark RDD?
I have an xml like: 1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232 We can parse normal xml file easily using scala XML support or even using databricks xml format, but how do I parse the xml embedded inside text. XML data alone can be extracted using: val top5duration = data.map(line => line.split("^")).filter(line => {line(2)==100}).map(line => line(4)) But how do I proceed if i want to extract values for each 'key?
Question: how are the nested XML elements treated? How would I access them? For flattening nested structure you can use explode... example : lets say I want every title (String type) / authors(WrappedArray) combinations, can achieve it with explode : schema : root |-- title: string (nullable = true) |-- author: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- initial: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- lastName: string (nullable = true) show() +--------------------+--------------------+ | title| author| +--------------------+--------------------+ |Proper Motions of...|[[WrappedArray(J,...| |Catalogue of 2055...|[[WrappedArray(J,...| | null| null| |Katalog von 3356 ...|[[WrappedArray(J)...| |Astrographic Cata...|[[WrappedArray(P)...| |Astrographic Cata...|[[WrappedArray(P)...| |Results of observ...|[[WrappedArray(H,...| | AGK3 Catalogue|[[WrappedArray(W)...| |Perth 70: A Catal...|[[WrappedArray(E)...| import org.apache.spark.sql.functions; DataFrame exploded = src.select(src.col("title"),functions.explode(src.col("author")).as("auth")) .select("title","auth.initial","auth.lastName"); exploded = exploded.select(exploded.col("initial"), exploded.col("title").as("title"), exploded.col("lastName")); exploded.printSchema exploded.show root |-- initial: array (nullable = true) | |-- element: string (containsNull = true) |-- title: string (nullable = true) |-- lastName: string (nullable = true) +-------+--------------------+-------------+ |initial| title| lastName| +-------+--------------------+-------------+ | [J, H]|Proper Motions of...| Spencer| | [J]|Proper Motions of...| Jackson| | [J, H]|Catalogue of 2055...| Spencer| sample xml file <?xml version='1.0' ?> <!DOCTYPE datasets SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/nasa/dataset_053.dtd"> <datasets> <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees of 20843 Stars for 1900</title> <altname type="ADC">1005</altname> <altname type="CDS">I/5</altname> <altname type="brief">Proper Motions in Cape Zone Catalogue -40/-52</altname> <reference> <source> <other> <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees of 20843 Stars for 1900</title> <author> <initial>J</initial> <initial>H</initial> <lastName>Spencer</lastName> </author> <author> <initial>J</initial> <lastName>Jackson</lastName> </author> <name>His Majesty's Stationery Office, London</name> <publisher>???</publisher> <city>???</city> <date> <year>1936</year> </date> </other> </source> </reference> <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> <keyword xlink:href="Positional_data.html">Positional data</keyword> <keyword xlink:href="Proper_motions.html">Proper motions</keyword> </keywords> <descriptions> <description> <para>This catalog, listing the proper motions of 20,843 stars from the Cape Astrographic Zones, was compiled from three series of photographic plates. The plates were taken at the Royal Observatory, Cape of Good Hope, in the following years: 1892-1896, 1897-1910, 1923-1928. Data given include centennial proper motion, photographic and visual magnitude, Harvard spectral type, Cape Photographic Durchmusterung (CPD) identification, epoch, right ascension and declination for 1900.</para> </description> <details/> </descriptions> <tableHead> <tableLinks> <tableLink xlink:href="czc.dat"> <title>The catalogue</title> </tableLink> </tableLinks> <fields> <field> <name>---</name> <definition>Number 5</definition> <units>---</units> </field> <field> <name>CZC</name> <definition>Catalogue Identification Number</definition> <units>---</units> </field> <field> <name>Vmag</name> <definition>Visual Magnitude</definition> <units>mag</units> </field> <field> <name>RAh</name> <definition>Right Ascension for 1900 hours</definition> <units>h</units> </field> <field> <name>RAm</name> <definition>Right Ascension for 1900 minutes</definition> <units>min</units> </field> <field> <name>RAcs</name> <definition>Right Ascension seconds in 0.01sec 1900</definition> <units>0.01s</units> </field> <field> <name>DE-</name> <definition>Declination Sign</definition> <units>---</units> </field> <field> <name>DEd</name> <definition>Declination for 1900 degrees</definition> <units>deg</units> </field> <field> <name>DEm</name> <definition>Declination for 1900 arcminutes</definition> <units>arcmin</units> </field> <field> <name>DEds</name> <definition>Declination for 1900 arcseconds</definition> <units>0.1arcsec</units> </field> <field> <name>Ep-1900</name> <definition>Epoch -1900</definition> <units>cyr</units> </field> <field> <name>CPDZone</name> <definition>Cape Photographic Durchmusterung Zone</definition> <units>---</units> </field> <field> <name>CPDNo</name> <definition>Cape Photographic Durchmusterung Number</definition> <units>---</units> </field> <field> <name>Pmag</name> <definition>Photographic Magnitude</definition> <units>mag</units> </field> <field> <name>Sp</name> <definition>HD Spectral Type</definition> <units>---</units> </field> <field> <name>pmRAs</name> <definition>Proper Motion in RA <footnote> <para>the relation is pmRA = 15 * pmRAs * cos(DE) if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para> </footnote> </definition> <units>0.1ms/yr</units> </field> <field> <name>pmRA</name> <definition>Proper Motion in RA</definition> <units>mas/yr</units> </field> <field> <name>pmDE</name> <definition>Proper Motion in Dec</definition> <units>mas/yr</units> </field> </fields> </tableHead> <history> <ingest> <creator> <lastName>Julie Anne Watko</lastName> <affiliation>SSDOO/ADC</affiliation> </creator> <date> <year>1995</year> <month>Nov</month> <day>03</day> </date> </ingest> </history> <identifier>I_5.xml</identifier> </dataset> <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees for the Equinox of 1900.0</title> <altname type="ADC">1006</altname> <altname type="CDS">I/6</altname> <altname type="brief">Cape 20554 Faint Stars, -40 to -52, 1900.0</altname> <reference> <source> <other> <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees for the Equinox of 1900.0</title> <author> <initial>J</initial> <initial>H</initial> <lastName>Spencer</lastName> </author> <author> <initial>J</initial> <lastName>Jackson</lastName> </author> <name>His Majesty's Stationery Office, London</name> <publisher>???</publisher> <city>???</city> <date> <year>1939</year> </date> <bibcode>1939HMSO..C......0S</bibcode> </other> </source> </reference> <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> <keyword xlink:href="Positional_data.html">Positional data</keyword> <keyword xlink:href="Proper_motions.html">Proper motions</keyword> </keywords> <descriptions> <description> <para>This catalog contains positions, precessions, proper motions, and photographic magnitudes for 20,554 stars. These were derived from photographs taken at the Royal Observatory, Cape of Good Hope between 1923 and 1928. It covers the astrographic zones -40 degrees to -52 degrees of declination. The positions are given for epoch 1900 (1900.0). It includes spectral types for many of the stars listed. It extends the earlier catalogs derived from the same plates to fainter magnitudes. The computer-readable version consists of a single data table.</para> <para>The stated probable error for the star positions is 0.024 seconds of time (R.A.) and 0.25 seconds of arc (dec.) for stars with one determination, 0.017 seconds of time, and 0.18 seconds of arc for two determinations, and 0.014 / 0.15 for stars with three determinations.</para> <para>The precession and secular variations were derived from Newcomb's constants.</para> <para>The authors quote probable errors of the proper motions in both coordinates of 0.008 seconds of arc for stars with one determination, 0.0055 seconds for stars with two determinations, and 0.0044 for stars with three.</para> <para>The photographic magnitudes were derived from the measured diameters on the photographic plates and from the magnitudes given in the Cape Photographic Durchmusterung.</para> <para>The spectral classification of the cataloged stars was done with the assistance of Annie Jump Cannon of the Harvard College Observatory.</para> <para>The user should consult the source reference for more details of the measurements and reductions. See also the notes in this document for additional information on the interpretation of the entries.</para> </description> <details/> </descriptions> <tableHead> <tableLinks> <tableLink xlink:href="faint.dat"> <title>Data</title> </tableLink> </tableLinks> <fields> <field> <name>ID</name> <definition>Cape Number</definition> <units>---</units> </field> <field> <name>rem</name> <definition>Remark <footnote> <para>A = Astrographic Star F = Faint Proper Motion Star N = Other Note</para> </footnote> </definition> <units>---</units> </field> <field> <name>CPDZone</name> <definition>Cape Phot. Durchmusterung (CPD) Zone <footnote> <para>All CPD Zones are negative. - signs are not included in data. "0" in column 8 signifies Astrographic Plate instead of CPD.</para> </footnote> </definition> <units>---</units> </field> <field> <name>CPD</name> <definition>CPD Number or Astrographic Plate <footnote> <para>See also note on CPDZone. Astrographic plate listed "is the more southerly on which the star occurs." Thus, y-coordinate is positive wherever possible.</para> </footnote> </definition> <units>---</units> </field> <field> <name>n_CPD</name> <definition>[1234] Remarks <footnote> <para>A number from 1-4 appears in this byte for double stars where the same CPD number applies to more than one star.</para> </footnote> </definition> <units>---</units> </field> <field> <name>mpg</name> <definition>Photographic Magnitude <footnote> <para>The Photographic Magnitude is "determined from the CPD Magnitude and the diameter on the Cape Astrographic Plates by means of the data given in the volume on the Magnitudes of Stars in the Cape Zone Catalogue." A null value (99.9) signifies a variable star.</para> </footnote> </definition> <units>mag</units> </field> <field> <name>RAh</name> <definition>Mean Right Ascension hours 1900</definition> <units>h</units> </field> <field> <name>RAm</name> <definition>Mean Right Ascension minutes 1900</definition> <units>min</units> </field> <field> <name>RAs</name> <definition>Mean Right Ascension seconds 1900</definition> <units>s</units> </field> <field> <name>DEd</name> <definition>Mean Declination degrees 1900</definition> <units>deg</units> </field> <field> <name>DEm</name> <definition>Mean Declination arcminutes 1900</definition> <units>arcmin</units> </field> <field> <name>DEs</name> <definition>Mean Declination arcseconds 1900</definition> <units>arcsec</units> </field> <field> <name>N</name> <definition>Number of Observations</definition> <units>---</units> </field> <field> <name>Epoch</name> <definition>Epoch +1900</definition> <units>yr</units> </field> <field> <name>pmRA</name> <definition>Proper Motion in RA seconds of time</definition> <units>s/a</units> </field> <field> <name>pmRAas</name> <definition>Proper Motion in RA arcseconds</definition> <units>arcsec/a</units> </field> <field> <name>pmDE</name> <definition>Proper Motion in Dec arcseconds</definition> <units>arcsec/a</units> </field> <field> <name>Sp</name> <definition>HD Spectral Type</definition> <units>---</units> </field> </fields> </tableHead> <history> <ingest> <creator> <lastName>Julie Anne Watko</lastName> <affiliation>SSDOO/ADC</affiliation> </creator> <date> <year>1996</year> <month>Mar</month> <day>26</day> </date> </ingest> </history> <identifier>I_6.xml</identifier> </dataset> <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> <title>Proper Motions of 1160 Late-Type Stars</title> <altname type="ADC">1014</altname> <altname type="CDS">I/14</altname> <altname type="brief">Proper Motions of 1160 Late-Type Stars</altname> <reference> <source> <journal> <title>Proper Motions of 1160 Late-Type Stars</title> <author> <initial>H</initial> <initial>J</initial> <lastName>Fogh Olsen</lastName> </author> <name>Astron. Astrophys. Suppl. Ser.</name> <volume>2</volume> <pageno>69</pageno> <date> <year>1970</year> </date> <bibcode>1970A&AS....2...69O</bibcode> </journal> </source> <related> <holding role="similar">II/38 : Stars observed photoelectrically by Dickow et al. <xlink:simple href="II/38"/> </holding>Fogh Olsen H.J. 1970, Astron. Astrophys. Suppl. Ser., 2, 69. Fogh Olsen H.J. 1970, Astron. Astrophys., Suppl. Ser., 1, 189.</related> </reference> <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> <keyword xlink:href="Proper_motions.html">Proper motions</keyword> </keywords> <descriptions> <description> <para>Improved proper motions for the 1160 stars contained in the photometric catalog by Dickow et al. (1970) are presented. Most of the proper motions are from the GC, transferred to the system of FK4. For stars not included in the GC, preliminary AGK or SAO proper motions are given. Fogh Olsen (Astron. Astrophys. Suppl. Ser., 1, 189, 1970) describes the method of improvement. The mean errors of the centennial proper motions increase with increasing magnitude. In Right Ascension, these range from 0.0043/cos(dec) for very bright stars to 0.096/cos(dec) for the faintest stars. In Dec- lination, the range is from 0.065 to 1.14.</para> </description> <details/> </descriptions> <tableHead> <tableLinks> <tableLink xlink:href="pmlate.dat"> <title>Proper motion data</title> </tableLink> </tableLinks> <fields> <field> <name>No</name> <definition>Number <footnote> <para>Henry Draper or Bonner Durchmusterung number</para> </footnote> </definition> <units>---</units> </field> <field> <name>pmRA</name> <definition>Centennial Proper Motion RA</definition> <units>s/ca</units> </field> <field> <name>pmDE</name> <definition>Centennial Proper Motion Dec</definition> <units>arcsec/ca</units> </field> <field> <name>RV</name> <definition>Radial Velocity</definition> <units>km/s</units> </field> </fields> </tableHead> <history> <ingest> <creator> <lastName>Julie Anne Watko</lastName> <affiliation>ADC</affiliation> </creator> <date> <year>1996</year> <month>Jun</month> <day>03</day> </date> </ingest> </history> <identifier>I_14.xml</identifier> </dataset> <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9"> <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950 +89 degrees</title> <altname type="ADC">1016</altname> <altname type="CDS">I/16</altname> <altname type="brief">Catalog of 3356 Faint Stars, 1950</altname> <reference> <source> <other> <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950 +89 degrees</title> <author> <initial>J</initial> <lastName>Larink</lastName> </author> <author> <initial>A</initial> <lastName>Bohrmann</lastName> </author> <author> <initial>H</initial> <lastName>Kox</lastName> </author> <author> <initial>J</initial> <lastName>Groeneveld</lastName> </author> <author> <initial>H</initial> <lastName>Klauder</lastName> </author> <name>Verlag der Sternwarte, Hamburg-Bergedorf</name> <publisher>???</publisher> <city>???</city> <date> <year>1955</year> </date> <bibcode>1955</bibcode> </other> </source> </reference> <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html"> <keyword xlink:href="Fundamental_catalog.html">Fundamental catalog</keyword> <keyword xlink:href="Positional_data.html">Positional data</keyword> <keyword xlink:href="Proper_motions.html">Proper motions</keyword> </keywords> <descriptions> <description> <para>This catalog of 3356 faint stars was derived from meridian circle observations at the Bergedorf and Heidelberg Observatories. The positions are given for the equinox 1950 on the FK3 system. The stars are mainly between 8.0 and 10.0 visual magnitude. A few are brighter than 8.0 mag. The lower limit in brightness resulted from the visibility of the stars.</para> </description> <details> <para>All stars were observed at both the Heidelberg and Bergedorf Observatories. Normally, at each observatory, two observations were obtained with the clamp east and two with the clamp west. The mean errors are comparable for the two observatories with no significant systematic difference in the positions between them. The mean errors of the resulting positions should be approximated 0.011s/cos(dec) in right ascension and ).023" in declination.</para> <para>The proper motions were derived from a comparison with the catalog positions with the positions in the AGK2 and AGK2A with a 19 year baseline and from a comparison of new positions with those in Kuestner 1900 with about a fifty year baseline.</para> <para>The magnitudes were taken from the AGK2. Most spectral types were determined by A. N. Vyssotsky. A few are from the Bergedorfer Spektraldurchmusterung.</para> </details> </descriptions> <tableHead> <tableLinks> <tableLink xlink:href="catalog.dat"> <title>The catalog</title> </tableLink> </tableLinks> <fields> <field> <name>ID</name> <definition>Catalog number</definition> <units>---</units> </field> <field> <name>DMz</name> <definition>BD zone</definition> <units>---</units> </field> <field> <name>DMn</name> <definition>BD number</definition> <units>---</units> </field> <field> <name>mag</name> <definition>Photographic magnitude</definition> <units>mag</units> </field> <field> <name>Sp</name> <definition>Spectral class</definition> <units>---</units> </field> <field> <name>RAh</name> <definition>Right Ascension hours (1950)</definition> <units>h</units> </field> <field> <name>RAm</name> <definition>Right Ascension minutes (1950)</definition> <units>min</units> </field> <field> <name>RAs</name> <definition>Right Ascension seconds (1950)</definition> <units>s</units> </field> <field> <name>Pr-RA1</name> <definition>First order precession in RA per century</definition> <units>0.01s/a</units> </field> <field> <name>Pr-RA2</name> <definition>Second order precession in RA per century</definition> <units>0.0001s2/a2</units> </field> <field> <name>pmRA</name> <definition>Proper motion in RA from AGK2 positions</definition> <units>0.01s/a</units> </field> <field> <name>pmRA2</name> <definition>Proper motion in RA from Kuestner positions</definition> <units>0.01s/a</units> </field> <field> <name>DE-</name> <definition>Sign of declination (1950)</definition> <units>---</units> </field> <field> <name>DEd</name> <definition>Declination degrees (1950)</definition> <units>deg</units> </field> <field> <name>DEm</name> <definition>Declination minutes (1950)</definition> <units>arcmin</units> </field> <field> <name>DEs</name> <definition>Declination seconds (1950)</definition> <units>arcsec</units> </field> <field> <name>Pr-de1</name> <definition>First order precession in dec per century</definition> <units>arcsec/ha</units> </field> <field> <name>Pr-de2</name> <definition>Second order precession in dec per century</definition> <units>arcsec2/ha2</units> </field> <field> <name>pmdec</name> <definition>Proper motion in DE from AGK2 positions</definition> <units>arcsec/ha</units> </field> <field> <name>pmdec2</name> <definition>Proper motion in DE from Kuestner positions</definition> <units>arcsec/ha</units> </field> <field> <name>epoch</name> <definition>Epoch of observation - 1900.0</definition> <units>yr</units> </field> <field> <name>rem</name> <definition>Note for star in printed catalog <footnote> <para>1 = ma (blend?) 3 = pr (preceding) 4 = seq (following) 5 = bor (northern) 6 = au (southern) * = other note in printed volume (All notes in the printed volume have not been indicated in this version.) the printed volume sometimes has additional information on the systems with numerical remarks.</para> </footnote> </definition> <units>---</units> </field> </fields> </tableHead> <history> <ingest> <creator> <lastName>Nancy Grace Roman</lastName> <affiliation>ADC/SSDOO</affiliation> </creator> <date> <year>1996</year> <month>Feb</month> <day>01</day> </date> </ingest> </history> <identifier>I_16.xml</identifier> </dataset> </datasets>
If you have XML alone in RDD[String] format, you can convert it to DataFrame with Databricks utility class: com.databricks.spark.xml.XmlReader#xmlRdd
You could use SGML for parsing your text file using SGML's SHORTREF feature for parsing mixed CSVs like yours and Wiki syntaxes. With SHORTREF you declare text tokens to be replaced into other text (typically start- and end-element tags). <DOCTYPE data [ <!ELEMENT data O O (field+)> <!ELEMENT field O O (#PCDATA|markup)> <!ELEMENT markup O O (row)> <!ELEMENT row - - (ab+)> <!ELEMENT ab - - (#PCDATA)> <!ENTITY start-field "<field>"> <!SHORTREF in-data "^" start-field> <!USEMAP in-data data> <!ENTITY start-markup "<markup>"> <!ENTITY end-markup "</markup>"> <!SHORTREF in-field "`" start-markup> <!USEMAP in-field field> <!SHORTREF in-markup "`" end-markup> <!USEMAP in-markup markup> ]> 1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232 Parsing this using SGML will result in the following <data> <field>1234</field> <field>12</field> <field>999</field> <field> <markup> <row> <ab key="someKey" value="someValue"/> <ab key="someKey1" value="someValue1"/> </row> </markup> </field> <field>23232</field> </data> The SHORTREF and USEMAP declarations tell SGML to treat a caret character as a start-element tag for <field> when in data child content, and to treat a backtick character as start-element tag for markup when in field child content. When in markup child content, another backtick character ends the markup element. SGML will also infer omitted start- and end-element tags based on O omission indicators and content model rules. Edit: to make this work without changing your data file (datafile.csv, say), instead of including the content verbatim into the master SGML file, declare and place an entity reference to it like this: <!DOCTYPE data [ <!-- ... same declarations as above ... --> <ENTITY datafile SYSTEM "datafile.csv"> ]> &datafile SGML will pull the content of datafile.csv into the datafile entity and replace the &datafile entity reference with the file content.
I tried parsing mentioned data without using xplode (dataframe) in RDD level. Please suggest any improvements. Read the data as text file and define a schema split string using delimiter ^ filter out bad records which don't confer to schema match the data against the schema defined earlier. Now you will have data like below in a tuple and we are left to parse the middle xml data. (1234,12,999,"<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>, 23232) xml.attribute("key") as it will either return all the keys. if you need value someValue and not interested in someValue1, then loop through this node sequence and apply filter of contains("key") to eliminate other keys. I have used key Duration that was present in the data. apply xpath \"#value" on previous step to get value. similar question in cloudera //define a case class for schema match with data input case class stb (server_unique_id:Int,request_type:Int,event_id:Int,stb_timestamp:String,stb_xml:String,device_id:String,secondary_timestamp: String) val data = spark.read.textFile(args(0)).rdd;///read data from supplied path from CLI //check for ^ delimiter and 7 fields, else filter out var clean_Data = data.filter { line => {line.trim().contains("^")}} .map { line => {line.split("\\^")}} .filter{ line => line.length == 7} //match the schema and filter out data having event id = 100 and the tag having Duration var tup_Map = clean_Data.map{ line => stb (line(0).toInt,line(1).toInt,line(2).toInt,line(3),line(4),line(5),line(6))} .filter(line => (line.event_id == 100 && line.stb_xml.contains("Duration"))); //xml is of name-value format, hence the attrbutes are all same(n,v) //parse through the xml structure and find out necessary data //xmlnv will parse top level to nodeseq having 8 different data like duration,channel in self closing tags //and name-value format var xml_Map = tup_Map.map{line => var xmld = XML.loadString(line.stb_xml); var xmlnv = xmld \\ "nv"; var duration = 0; for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("Duration") } duration = (xmlnv(i) \\ "#v").text.toInt; var channelNum = 0; for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelNumber") } channelNum = (xmlnv(i) \\ "#v").text.toInt; var channelType = ""; for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelType") } channelType = (xmlnv(i) \\ "#v").text; (duration, channelNum, channelType,line.device_id) } //persist xml_Map for further operations xml_Map.persist();
Solr multifield query sorting incorrectly
I am trying to customize a multifield score on solr. < field name="cat" type="string" indexed="true" stored="true" required="true" multiValued="false" /> < field name="name" type="text_en_splitting" indexed="true" stored="true" required="true" /> < field name="tags" type="text_general" indexed="true" stored="true" required="true" multiValued="true" /> < field name="popularity" type="sint" indexed="true" stored="true" required="true" /> < field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> where sint is a sortable int I want to rank by query match with 'name' 'popularity' query match with 'tags'. Here is my solrj code: HttpSolrServer solr = new HttpSolrServer("http://localhost:8983/solr"); SolrQuery query = new SolrQuery(); query.setQuery("National"); query.setFields("cat name tags popularity id score"); query.setStart(0); //boost query.set("defType", "edismax"); query.set("qf", "name^999.0 tags^2.0 popularity^-1000.0"); QueryResponse response = solr.query(query); SolrDocumentList results = response.getResults(); for (int i = 0; i < results.size(); ++i) { System.out.println(results.get(i)); } When I test, all my docs give the same score even though they have different tags and popularity. I do not want to do copyfield types and setAddField only sorts based on one field... This seems like a very common issue but I scored the whole web and haven't found a solution to my problem. Thanks in advance for the responses! :)
First, to understand how dismax score is computed, see: http://searchhub.org/2010/05/23/whats-a-dismax/ where it says: score is determined by the maximum score of it’s subclauses — instead of the sum like a BooleanQuery. The individual scores are not added, but most likely dominated by the name match, which is why you are seeing the same score for all results. Also I am not sure what your intent with popularity^-1000.0 is. I guess you want to boost the results by popularity. You can use the standard query parser and try something like this: q={!boost popularity}name:National^999.0 OR tags:National^2.0
Use termfreq(field,term) function for phrase with space in SOLR 4.1
I am using termfreq(field,term) SOLR function. This works: ?fl=product_name,termfreq(product_name,"iphon")&q=iphone 4s //Found freq But the problem is to have term like "iphone 4s" with space ?fl=product_name,termfreq(product_name,"iphon 4s")&q=iphone 4s //Return 0 freq Return 0 freq although that term(phrase) exist in doc. So, the question is, can I use termfreq() function with full phrase like "iphone 4s", And how? I am using SOLR 4.1. and analyzer for field is <fieldType name="text_ws" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> Field is <field name="product_name" type="text_ws" indexed="true" stored="true"/>
As you are using a WhitespaceTokenizerFactory the term iphone 4s would not exist as a term. You could use KeywordTokenizerFactory for indexing, which doesn't tokenize the words and the phrase should be available. Else you can check for shingle options which would group words for you.
Pentaho Data Integration Mapping
I am using Pentaho Data Integration, I created a new transformation and I have 2 steps in it....1 is a CSV file of my data, the second is an Excel file with two columns one is are the state names and the other the sort form of that state name, Example ("New York" "NY") In my CSV file I have a state columns with the state names "New York" I want to use my excel file to map "New York" with "NY" I have googled this all day with no clear answer...can anyone help?
You can use Merge Join. Using this you can merge both the files and select the desired columns. Before merging, you have to sort those files according to fields which use are using for mapping. In your case, it will be state name.
I would recommend you to use stream lookup to do this task. Check the test transformation attached. It will do your task. <?xml version="1.0" encoding="UTF-8"?> <transformation-steps> <steps> <step> <name>EXCEL</name> <type>DataGrid</type> <description/> <distribute>Y</distribute> <custom_distribution/> <copies>1</copies> <partitioning> <method>none</method> <schema_name/> </partitioning> <fields> <field> <name>State</name> <type>String</type> <format/> <currency/> <decimal/> <group/> <length>-1</length> <precision>-1</precision> <set_empty_string>N</set_empty_string> </field> <field> <name>Short_state</name> <type>String</type> <format/> <currency/> <decimal/> <group/> <length>-1</length> <precision>-1</precision> <set_empty_string>N</set_empty_string> </field> </fields> <data> <line> <item>New York</item><item>TX</item> </line> <line> <item>Texas</item><item>TX</item> </line> </data> <cluster_schema/> <remotesteps> <input> </input> <output> </output> </remotesteps> <GUI> <xloc>392</xloc> <yloc>80</yloc> <draw>Y</draw> </GUI> </step> <step> <name>CSV</name> <type>DataGrid</type> <description/> <distribute>Y</distribute> <custom_distribution/> <copies>1</copies> <partitioning> <method>none</method> <schema_name/> </partitioning> <fields> <field> <name>Full_state_name</name> <type>String</type> <format/> <currency/> <decimal/> <group/> <length>-1</length> <precision>-1</precision> <set_empty_string>N</set_empty_string> </field> </fields> <data> <line> <item>New York</item> </line> <line> <item>Texas</item> </line> </data> <cluster_schema/> <remotesteps> <input> </input> <output> </output> </remotesteps> <GUI> <xloc>511</xloc> <yloc>169</yloc> <draw>Y</draw> </GUI> </step> <step> <name>Stream lookup</name> <type>StreamLookup</type> <description/> <distribute>Y</distribute> <custom_distribution/> <copies>1</copies> <partitioning> <method>none</method> <schema_name/> </partitioning> <from>EXCEL</from> <input_sorted>N</input_sorted> <preserve_memory>Y</preserve_memory> <sorted_list>N</sorted_list> <integer_pair>N</integer_pair> <lookup> <key> <name>Full_state_name</name> <field>State</field> </key> <value> <name>State</name> <rename>State</rename> <default/> <type>String</type> </value> <value> <name>Short_state</name> <rename>Short_state</rename> <default/> <type>String</type> </value> </lookup> <cluster_schema/> <remotesteps> <input> </input> <output> </output> </remotesteps> <GUI> <xloc>510</xloc> <yloc>79</yloc> <draw>Y</draw> </GUI> </step> </steps> <order> <hop> <from>EXCEL</from><to>Stream lookup</to><enabled>Y</enabled> </hop> <hop> <from>CSV</from><to>Stream lookup</to><enabled>Y</enabled> </hop> </order> <notepads> </notepads> <step_error_handling> </step_error_handling> </transformation-steps>