How to resolve the error name:_name = "label" if "label" in features[0].keys() else "labels" in Hugging face NER - python-3.x

I am trying to run custom NER on my data using offset values. I tried to replicate using this link << https://huggingface.co/course/chapter7/2 >>
I keep getting the error
variable name:_name = "label" if "label" in features[0].keys() else "labels"
DATA BEFORE tokenize_and_align_labels FUNCTIONS
{'texts': ['WASHINGTON USA WA DRIVER LICENSE BESSETTE Lamma 4d DL 73235766 9 Class AM to Iss 22/03/2021 Ab Exp 07130/2021 DOB 2/28/21 1 BESSETTE 2 GERALD 8 6930 NE Grandview Blvd, keyport, WA 86494 073076 12 Restrictions A 9a End P 16 Hgt 5\'-04" 15 Sex F 18 Eyes BLU 5 DD 73235766900000000000 Gerald Bessette', ] }
tag_names': [
[
{'start': 281, 'end': 296, 'tag': 'PERSON_NAME', 'text': 'Gerald Bessette'},
{'start': 135, 'end': 141, 'tag': 'FIRST_NAME', 'text': 'GERALD'},
{'start': 124, 'end': 122, 'tag': 'LAST_NAME', 'text': 'BESSETTE'},
{'start': 81, 'end': 81, 'tag': 'ISSUE_DATE', 'text': '22/03/2021'},
{'start': 99, 'end': 109, 'tag': 'EXPIRY_DATE', 'text': '07130/2021'},
{'start': 114, 'end': 121, 'tag': 'DATE_OF_BIRTH', 'text': '2/28/21'},
{'start': 51, 'end': 59, 'tag': 'DRIVER_LICENSE_NUMBER', 'text': '73235766'},
{'start': 144, 'end': 185, 'tag': 'ADDRESS', 'text': '6930 NE Grandview Blvd, keyport, WA 86494'}
],
DATA AFTER tokenize_and_align_labels FUNCTIONS
{'input_ids':
[[0, 305, 8684, 2805, 9342, 10994, 26994, 42560, 39951, 163, 12147, 3935, 6433, 6887, 1916, 204, 417, 13925, 6521, 1922, 4390, 4280, 361,
4210, 3326, 7, 19285, 820, 73, 3933, 73, 844, 2146, 2060, 12806, 321, 5339, 541, 73, 844, 2146, 14010, 387, 132, 73, 2517, 73, 2146, 112,
163, 12147, 3935, 6433, 132, 272, 39243, 495, 290, 5913, 541, 12462, 2374, 5877, 12543, 6, 762, 3427, 6, 9342, 290, 4027, 6405, 13470, 541,
5067, 316, 40950, 2485, 83, 361, 102, 4680, 221, 545, 289, 19377, 195, 32269, 3387, 113, 379, 15516, 274, 504, 26945, 12413, 791, 195, 27932,
6521, 1922, 4390, 36400, 45947, 151, 14651, 163, 3361, 3398, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'attention_mask':
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'offset_mapping': [[(0, 0), (0, 1), (1, 10), (11, 14), (15, 17), (18, 20), (20, 24), (25, 28), (28, 32), (33, 34), (34, 37), (37, 39), (39, 41),
(42, 45), (45, 47), (48, 49), (49, 50), (51, 53), (54, 56), (56, 58), (58, 60), (60, 62), (63, 64), (65, 70), (71, 73),
(74, 76), (77, 80), (81, 83), (83, 84), (84, 86), (86, 87), (87, 89), (89, 91), (92, 94), (95, 98), (99, 100), (100, 102),
(102, 104), (104, 105), (105, 107), (107, 109), (110, 112), (112, 113), (114, 115), (115, 116), (116, 118), (118, 119),
(119, 121), (122, 123), (124, 125), (125, 128), (128, 130), (130, 132), (133, 134), (135, 136), (136, 140), (140, 141),
(142, 143), (144, 146), (146, 148), (149, 151), (152, 157), (157, 161), (162, 166), (166, 167), (168, 171), (171, 175),
(175, 176), (177, 179), (180, 181), (181, 183), (183, 185), (186, 188), (188, 190), (190, 192), (193, 195), (196, 204),
(204, 208), (209, 210), (211, 212), (212, 213), (214, 217), (218, 219), (220, 222), (223, 224), (224, 226), (227, 228),
(228, 230), (230, 232), (232, 233), (234, 236), (237, 240), (241, 242), (243, 245), (246, 250), (251, 253), (253, 254),
(255, 256), (257, 259), (260, 262), (262, 264), (264, 266), (266, 269), (269, 277), (277, 280), (281, 287), (288, 289),
(289, 292), (292, 296), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0)]
'labels': [[24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 2, 10, 10, 18, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 3, 11, 11, 11, 11, 19, 24, 24, 1, 9, 9, 9, 17, 24, 24, 24, 24, 24, 24, 4, 12, 20, 24, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
16, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 7, 15, 15, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24],
My Code:
import transformers
from transformers import AutoTokenizer
from transformers import AutoTokenizer,BertModel,BertTokenizer
from transformers import RobertaModel,RobertaConfig,RobertaForTokenClassification
from transformers import TrainingArguments, Trainer
# from transformers.trainer import get_tpu_sampler
from transformers.trainer_pt_utils import get_tpu_sampler
from transformers.data.data_collator import DataCollator, InputDataClass
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification
import torch
from torch.nn import CrossEntropyLoss, MSELoss
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler
from torchcrf import CRF
import dataclasses
import logging
import warnings
import tqdm
import os
import numpy as np
from typing import List, Union, Dict
os.environ["WANDB_DISABLED"] = "true"
print(transformers.__version__)
import evaluate
metric = evaluate.load("seqeval")
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) #add_prefix_space=True
def isin(a, b):
return a[1] > b[0] and a[0] < b[1]
def tokenize_and_align_labels(examples, label2id, max_length=256):
tokenized_inputs = tokenizer(examples["texts"], truncation=True, padding='max_length', max_length=max_length,return_offsets_mapping=True)
print("tokenization done")
labels = []
for i, label_idx_for_single_input in enumerate(tqdm.tqdm(examples["tag_names"])):
# print(i,label_idx_for_single_input)
labels_for_single_input = ['O' for _ in range(max_length)]
# print(labels_for_single_input)
text_offsets = tokenized_inputs['offset_mapping'][i]
# print("text_offsets",text_offsets)
for entity in label_idx_for_single_input:
# print("entity",entity)
tag = entity['tag']
# print("tag",tag)
tag_offset = [entity['start'], entity['end']]
# print("tag_offset",tag_offset)
# text_offsets [(0, 0), (0, 1), (1, 10), (11, 14), (15, 17), (18, 20), (20, 24), (25, 28), (28, 32), (33, 34), (34, 37), (37, 39), (39, 41), (42, 45), (45, 47), (48, 49), (49, 50), (51, 53), (54, 56), (56, 58), (58, 60), (60, 62), (63, 64), (65, 70), (71, 73), (74, 76), (77, 80), (81, 83), (83, 84), (84, 86), (86, 87), (87, 89), (89, 91), (92, 94), (95, 98), (99, 100), (100, 102), (102, 104), (104, 105), (105, 107), (107, 109), (110, 112), (112, 113), (114, 115), (115, 116), (116, 118), (118, 119), (119, 121), (122, 123), (124, 125), (125, 128), (128, 130), (130, 132), (133, 134), (135, 136), (136, 140), (140, 141), (142, 143), (144, 146), (146, 148), (149, 151), (152, 157), (157, 161), (162, 166), (166, 167), (168, 171), (171, 175), (175, 176), (177, 179), (180, 181), (181, 183), (183, 185), (186, 188), (188, 190), (190, 192), (193, 195), (196, 204), (204, 208), (209, 210), (211, 212), (212, 213), (214, 217), (218, 219), (220, 222), (223, 224), (224, 226), (227, 228), (228, 230), (230, 232), (232, 233), (234, 236), (237, 240), (241, 242), (243, 245), (246, 250), (251, 253), (253, 254), (255, 256), (257, 259), (260, 262), (262, 264), (264, 266), (266, 269), (269, 277), (277, 280), (281, 287), (288, 289), (289, 292), (292, 296), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0)]
# entity {'start': 281, 'end': 296, 'tag': 'PERSON_NAME', 'text': 'Gerald Bessette'}
# tag PERSON_NAME
# tag_offset [281, 296]
affected_token_ids = [j for j in range(max_length) if isin(tag_offset, text_offsets[j])]
# print("affected_token_ids",affected_token_ids)
if len(affected_token_ids) < 1:
# print('affected_token_ids)<1')
continue
if any(labels_for_single_input[j] != 'O' for j in affected_token_ids):
# print('entity orverlap! skipping')
continue
for j in affected_token_ids:
labels_for_single_input[j] = 'I_' + tag
labels_for_single_input[affected_token_ids[-1]] = 'L_' + tag
labels_for_single_input[affected_token_ids[0]] = 'B_' + tag
label_ids = [label2id[x] for x in labels_for_single_input]
labels.append(label_ids)
tokenized_inputs["labels"] = labels
# print(tokenized_inputs.keys())
return tokenized_inputs
import json
data = []
with open('data.json', 'r') as f:
for line in f:
data.append(json.loads(line))
l = []
for k, v in data[0].items():
l.append({'text': k, 'spans': v})
train_set = [
[
x['text'],
[{'start': y["start"], 'end': y["end"], 'tag': y["label"], 'text': y["ngram"]} for y in x['spans']]
] for x in l
]
## count labels in dataset
from collections import Counter
e = []
for x in train_set:
for y in x[1]:
e.append(y['tag'])
Counter(e).most_common()
## get label list
ori_label_list = []
for line in train_set:
ori_label_list += [entity['tag'] for entity in line[1]]
ori_label_list = sorted(list(set(ori_label_list)))
label_list = []
for prefix in 'BIL':
label_list += [prefix + '_' + x for x in ori_label_list]
label_list += ['O']
label_list = sorted(list(set(label_list)))
print(label_list)
print(len(label_list))
label2id = {n:i for i,n in enumerate(label_list)}
id2label= {str(i):n for i,n in enumerate(label_list)}
# id2label = {str(i): label for i, label in enumerate(label_names)}
# label2id = {v: k for k, v in id2label.items()}
train_examples ={'texts':[x[0] for x in train_set],'tag_names':[x[1] for x in train_set]}
train_examples = tokenize_and_align_labels(train_examples,label2id)
# train_examples = train_examples.map(tokenize_and_align_labels(label2id),batched=True)
print("here")
print(train_examples.keys())
print(len(train_examples['labels']))
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'labels'])
# 775
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# collator=data_collator(train_examples)
# def compute_metrics(eval_preds):
# logits, labels = eval_preds
# predictions = np.argmax(logits, axis=-1)
#
# # Remove ignored index (special tokens) and convert to labels
# true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
# true_predictions = [
# [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
# for prediction, label in zip(predictions, labels)
# ]
# all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
# return {
# "precision": all_metrics["overall_precision"],
# "recall": all_metrics["overall_recall"],
# "f1": all_metrics["overall_f1"],
# "accuracy": all_metrics["overall_accuracy"],
# }
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,id2label=id2label,label2id=label2id,)
print(model.config.num_labels)
args = TrainingArguments(
"bert-finetuned-ner",
# evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
num_train_epochs=2,
weight_decay=0.01,
# push_to_hub=True,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_examples,
# eval_dataset=train_examples,
data_collator=data_collator,
# compute_metrics=compute_metrics,
tokenizer=tokenizer)
trainer.train()
ERROR
_name = "label" if "label" in features[0].keys() else "labels"
AttributeError: 'tokenizers.Encoding' object has no attribute 'keys'

I think the object tokenized_inputs that you create and return in tokenize_and_align_labels is likely to be a tokenizers.Encoding object, not a dict or Dataset object (check this by printing type(myobject) when in doubt), and therefore it won't have keys.
You should apply your Tokenizer to your examples using the map function of Dataset, as in this example from the documentation.

Related

Conditional extraction of rows from 2D NUMPY array

I have a 2D numpy array my_data and I want to extract row i if the 4th column of row i has a 0 && the 4th column of row i-i has a 1.
I do this now using list comprehension but is there a "non-list" faster way to do this?
'''
my_data = array([(65535, 255, 0, 1, 1), (65535, 255, 0, 1, 1),
(65535, 255, 0, 1, 1), ..., (65535, 255, 0, 1, 1),
(65535, 255, 0, 1, 1), (65535, 255, 0, 1, 1)],
dtype=[('col_0', '<u2'), ('col_1', 'u1'), ('col_2', 'u1'), ('col_3', 'u1'), ('col_4', 'u1')])
filtered_data = [my_data[i] for i in range(1,len(my_data)) if (my_data[i][4] == 0 and my_data[i-1][4] == 1)]
'''
I noticed that your array is structured, with named columns,
so you can make some use of these features.
As a source array I used:
my_data = np.array([
(65535, 240, 0, 1, 1),
(65535, 241, 0, 1, 0),
(65535, 242, 0, 1, 1),
(65535, 243, 0, 1, 0),
(65535, 244, 0, 1, 1),
(65535, 245, 0, 1, 1)],
dtype=[('col_0', '<u2'), ('col_1', 'u1'), ('col_2', 'u1'),
('col_3', 'u1'), ('col_4', 'u1')])
The first step is to take the 4-th column and the same column again,
but shifted by 1 position to the right, inserting 0 as the initial
element:
c4 = my_data['col_4']
c4prev = np.insert(c4[:-1], 0, 0)
Both these arrays contain:
array([1, 0, 1, 0, 1, 1], dtype=uint8)
array([0, 1, 0, 1, 0, 1], dtype=uint8)
And to get the expected result use boolean indexing:
result = my_data[np.logical_and(c4 == 0, c4prev == 1)]
The result is:
array([(65535, 241, 0, 1, 0), (65535, 243, 0, 1, 0)],
dtype=[('col_0', '<u2'), ('col_1', 'u1'), ('col_2', 'u1'),
('col_3', 'u1'), ('col_4', 'u1')])
so just the intended rows have been extracted.

Creating a function for my code with datetime and timedelta

I have finished my code and it works when I run it. However, I need to turn this into a function, that if I call the function and pass in any list of number, i can get the same results. This is my code:
from datetime import datetime, timedelta
dateStr = 'user-input'
dateObj = datetime.strptime(dateStr, '%Y%m%d')
timeStep = timedelta(days=1)
dateObj2 = dateObj + timeStep
days15 = [dateObj + timeStep*i for i in range(15)]
print(days15)
------------------ output:
datetime.datetime(2017, 1, 1, 0, 0),..
I need to be able to pass in
date_str = "20170817"
results = days_15(date_str)
print(results)
And then get the same results. Any hints? or any help - Thank you
You just need to add a def statement before your code to define the function, and rather than printing the result, return it:
from datetime import datetime, timedelta
def days_15(dateStr):
dateObj = datetime.strptime(dateStr, '%Y%m%d')
timeStep = timedelta(days=1)
return [dateObj + timeStep*i for i in range(15)]
my_date_str = "20170817"
results = days_15(my_date_str)
print(results)
Output:
[
datetime.datetime(2017, 8, 17, 0, 0), datetime.datetime(2017, 8, 18, 0, 0),
datetime.datetime(2017, 8, 19, 0, 0), datetime.datetime(2017, 8, 20, 0, 0),
datetime.datetime(2017, 8, 21, 0, 0), datetime.datetime(2017, 8, 22, 0, 0),
datetime.datetime(2017, 8, 23, 0, 0), datetime.datetime(2017, 8, 24, 0, 0),
datetime.datetime(2017, 8, 25, 0, 0), datetime.datetime(2017, 8, 26, 0, 0),
datetime.datetime(2017, 8, 27, 0, 0), datetime.datetime(2017, 8, 28, 0, 0),
datetime.datetime(2017, 8, 29, 0, 0), datetime.datetime(2017, 8, 30, 0, 0),
datetime.datetime(2017, 8, 31, 0, 0)
]

Merging pickled .npz files in a desired format

I have multiple npz files which i want to merge into one npz.file with the format similar to "mnist.npz"
the format of mnist.npz is:
((array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[0, 0, 0, ..., 0, 0, 0]]], dtype=uint8),
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8))
Here two arrays are merged into one big npz file.
My two npz arrays are:
x_array:
[[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]],
[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]]
y_array:
[1, 1, 1, 1, 1, 1]
When i tried to merge my files, the output i got is:
(array([[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]]], dtype=uint8), array([[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]], dtype=uint8),1, 1, 1, 1, 1, 1)
So in the last line, my array is formated as
1, 1, 1, 1, 1, 1
instead of something like:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
My code for merging two npz files is:
data = load('x_array.npz',allow_pickle=True)
lst = data.files
for item in lst:
x_train = data[item]
#print((x_item,x_train))
data1 = load('y_array.npz',allow_pickle=True)
lst1 = data1.files
for item in lst1:
y_train = data1[item]
out1 = (*x_train,*y_train)
np.savez('out1.npz',out1)
print(out1)
Can anyone please suggest how i can convert my second array of (1, 1, 1, 1, 1, 1) to array([1, 1, 1, 1, 1, 1], dtype=uint8)? Any suggestions are helpful
After going through my code i found out that by changing the line
out1 = (*x_train,*y_train)
to
out1 = (*x_train,y_train)

PyTesseract image_to_data function isn't recognizing my image

I'm using pytesseract to return the coordinates of the objects in an image.
By using this piece of code:
import pytesseract
from pytesseract import Output
import cv2
img = cv2.imread('wine.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
print(d)
for i in range(n_boxes):
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
I get that:
{'level': [1, 2, 3, 4, 5, 5, 2, 3, 4, 5, 4, 5, 2, 3, 4, 5], 'page_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3], 'par_num': [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2, 2, 0, 0, 1, 1], 'word_num': [0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1], 'left': [0, 485, 485, 485, 485, 612, 537, 537, 555, 555, 537, 537, 454, 454, 454, 454], 'top': [0, 323, 323, 323, 323, 324, 400, 400, 400, 400, 426, 426, 0, 0, 0, 0], 'width': [1200, 229, 229, 229, 115, 102, 123, 123, 89, 89, 123, 123, 296, 296, 296, 296], 'height': [900, 29, 29, 29, 28, 28, 40, 40, 15, 15, 14, 14, 892, 892, 892, 892], 'conf': ['-1', '-1', '-1', '-1', 58, 96, '-1', '-1', '-1', 95, '-1', 95, '-1', '-1', '-1', 95], 'text': ['', '', '', '', "JACOB'S", 'CREEK', '', '', '', 'SHIRAZ', '', 'CABERNET', '', '', '', '']}
[image used][]1
However, when I use this image:
I get that:
{'level': [1, 2, 3, 4, 5], 'page_num': [1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1], 'par_num': [0, 0, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1], 'word_num': [0, 0, 0, 0, 1], 'left': [0, 0, 0, 0, 0], 'top': [0, 162, 162, 162, 162], 'width': [1200, 0, 0, 0, 0], 'height': [900, 276, 276, 276, 276], 'conf': ['-1', '-1', '-1', '-1', 95], 'text': ['', '', '', '', '']}
Any idea why some image are working and some aren't?
It is mainly caused by different quality and contrast. it is much easier for the OCR engine to detect texts in desired images.
you can add a few pre-processing routines, including thresholding, blurring, histogram equalization and lots of other techniques. it is mainly subjective so I can not provide you with working code, it is more like trial and error to find the best technique for your scope
UPDATE:
here is a code that might help you
def preprocessing_typing_detection(inputImage):
inputImage= cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)
inputImage= cv2.Laplacian(inputImage, cv2.CV_8U)
return inputImage

Strange Behaviour when Updating Cassandra row

I am using pyspark and pyspark-cassandra.
I have noticed this behaviour on multiple versions of Cassandra(3.0.x and 3.6.x) using COPY, sstableloader, and now saveToCassandra in pyspark.
I have the following schema
CREATE TABLE test (
id int,
time timestamp,
a int,
b int,
c int,
PRIMARY KEY ((id), time)
) WITH CLUSTERING ORDER BY (time DESC);
and the following data
(1, datetime.datetime(2015, 3, 1, 0, 18, 18, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 19, 12, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 22, 59, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 23, 52, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 2, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 8, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 43, 30, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 44, 12, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 48, 49, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 49, 7, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 5, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 53, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 53, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 59, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 54, 35, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 28, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 55, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 56, 24, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 14, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 17, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 8, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 10, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 43, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 49, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 12, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 24, tzinfo=<UTC>), 2, 1, 0)
Towards the end of the data, there are two rows which have the same timestamp.
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
It is my understanding that when I save to Cassandra, one of these will "win" - there will only be one row.
After writing to cassandra using
rdd.saveToCassandra(keyspace, table, ['id', 'time', 'a', 'b', 'c'])
Neither row appears to have won. Rather, the rows seem to have "merged".
1 | 2015-03-01 01:17:43+0000 | 1 | 2 | 0
1 | 2015-03-01 01:17:49+0000 | 0 | 3 | 0
1 | 2015-03-01 01:24:12+0000 | 1 | 2 | 0
1 | 2015-03-01 01:24:18+0000 | 2 | 2 | 0
1 | 2015-03-01 01:24:24+0000 | 2 | 1 | 0
Rather than the 2015-03-01 01:24:18+0000 containing (1, 2, 0) or (2, 1, 0), it contains (2, 2, 0).
What is happening here? I can't for the life of me figure out this behaviour is being caused.
This is a little known effect that comes from the batching together of data. Batching writes assigns the same timestamp to all Inserts in the batch. Next, if two writes are done with the exact same timestamp then there is a special merge rule since there was no "last" write. The Spark Cassandra Connector uses intra-partition batches by default so this is very likely to happen if you have this kind of clobbering of values.
The behavior with two identical write timestamps is a merge based on the Greater value.
Given Table (key, a, b)
Batch
Insert "foo", 2, 1
Insert "foo", 1, 2
End batch
The batch gives both mutations the same timestamp. Cassandra cannot chose a "last-written" since they both happened at the same time, instead it just chooses the greater value of the two. The merged result will be
"foo", 2, 2

Resources