How to parse a form with Apache Tika/Tesseract? - text

I have a scaned 3 paged document . It has some structure to it (N fields, some are hand writen text, some are numbers). I know names of each field. How to get field values?

Related

Power query nested table

I have xml file that i need to process. My output look like this :
In the nested table there is only one value but i cant figure out how to ungroup them.
Add a column and insert the following M (replacing channel.item.ht with whatever your column name actually is).
if [channel.item.ht] is table then Record.ToList([channel.item.ht]{0}){0} else [channel.item.ht]

Convert a hive extract file into mainframe file layout

I have generated an hive extract .For instance,It has below columns
fields-->a1,a2,a3,b,c,d,ee1,e2,f1,f2
I need to combine a1,a2,a3 fields into one field as 'a'
once it is combined, i have to take each record and apply need some vector elements for some fields when it is migrated to mainframe.Since in hive vector fields are not applicable,we used to create the source table with different coulmns for the no of vector incidences like e1,e2, f1,f2
for eg,this is the format which I needed
record
ebcdic string e;
ebcdic string f;
end [2]
Now what I need to do is write a hive query to transform normal file layout in hive into above format.Since I am not familar with this can any one suggest some logic to solve this?
Thanks in advance.

solr query to sort result in descending order on basis of price

I am very beiginer in Solr and I am trying to do query on my data. I am trying to find data with name=plant and sort it by maximum price
my schema for both name and price is text type.
for eg let say data is
name:abc, price:25;
name:plant, price:35;
name:plant,price:45; //1000 other data
My Approach
/query?q=(name:"Plant")&stopwords=true
but above is giving me result of plants but I am not sure how to sort result using price feild
Any help will be appreciated
You can use the sort param for achieving the sorting.
Your query would be like q=(name:"Plant")&sort=price desc
The sort parameter arranges search results in either ascending (asc)
or descending (desc) order. The parameter can be used with either
numerical or alphabetical content. The directions can be entered in
either all lowercase or all uppercase letters (i.e., both asc or ASC).
Solr can sort query responses according to document scores or the
value of any field with a single value that is either indexed or uses
DocValues (that is, any field whose attributes in the Schema include
multiValued="false" and either docValues="true" or indexed="true" – if
the field does not have DocValues enabled, the indexed terms are used
to build them on the fly at runtime), provided that:
the field is non-tokenized (that is, the field has no analyzer and its
contents have been parsed into tokens, which would make the sorting
inconsistent), or
the field uses an analyzer (such as the KeywordTokenizer) that
produces only a single term.

Cassandra 2.2.11 add new map column from text column

Let's say I have table with 2 columns
primary key: id - type varchar
and non-primary-key: data - type text
Data column consist only of json values for example like:
{
"name":"John",
"age":30
}
I know that i can not alter this column to map type but maybe i can add new map column with values from data column or maybe you have some other idea?
What can i do about it ? I want to get map column in this table with values from data
You might want to make use of the CQL COPY command to export all your data to a CSV file.
Then alter your table and create a new column of type map.
Convert the exported data to another file containing UPDATE statements where you only update the newly created column with values converted from JSON to a map. For conversion use a tool or language of your choice (be it bash, python, perl or whatever).
BTW be aware, that with map you specify what data type is your map's key and what data type is your map's value. So you will most probably be limited to use strings only if you want to be generic, i.e. a map<text, text>. Consider whether this is appropriate for your use case.

cassandra + pig with wide columns

I am currently working on a recommender application and I am using cassandra with hadoop and pig for map/reduce jobs.
To take advantage of the column names properties our team has decided to store data using valueless columns and aggregate column names so for example all hits for a specific content are stored in a column family with a single row, and each column is a hit for the content using the following structure:
rowkey = 'single_row' {
id_content:hit_date, -
.
.
.
}
With this schema we obtain wide rows instead of skinny; the question is, how do i need to manipulate data in Pig in order to store data in cassandra with this schema?
I'm not sure from your comment if you're using composite columns, or whether you're just concatenating id_content and hit_date.
For normal (i.e. non-composite) columns, the schema is:
(key, {(col_name, col_value), ...})
In the case of composite columns, I believe the schema is the following:
(key, {((col_name_part_1, col_name_part_2), col_value), ...})
This assessment (for composite columns) is based on reading the patch submitted on https://issues.apache.org/jira/browse/CASSANDRA-3684

Resources