Converting Ontonotes ( .gold_conll file format) to DocBins (.spacy) or .jsonl files - nlp

TLDR; How to convert .gold_conll (as seen in table below) to .spacy or .jsonl?
#begin document (bc/cctv/00/cctv_0001); part 000
bc/cctv/00/cctv_0001 0 6 a DT (NP(NP* - - - Speaker#1 * * (ARG1* -
bc/cctv/00/cctv_0001 0 7 special JJ * - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 8 edition NN *) - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 9 of IN (PP* - - - Speaker#1 * * * -
bc/cctv/00/cctv_0001 0 10 Across NNP (NP* - - - Speaker#1 (ORG* * * -
bc/cctv/00/cctv_0001 0 11 China NNP *))))))) - - - Speaker#1 *) *) *) -
#end document
My plan is to work with the OntoNotes data in spaCy and prodigy. I have retrieved the OntoNotes dataset, however, I am struggling to convert my data to a format.
I see that the format is similar, but not identical, to two of the sample data formats that may be converted using the python -m spacy converter CLI-command both trying the --converter auto, conllu, and conll. Above is a table of what my data looks like, and https://data.mendeley.com/datasets/zmycy7t9h9/2 is where I have retrieved it from. I have been unable to find the entire dataset anywhere else, and I see that none of the https://huggingface.co/datasets/tner/ontonotes5 are available with 1) all the data, and 2) in the right format.

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u25c4' in position 276445: character maps to <undefined>

In this project, I am trying to utilize the pycaret package to analyze some time series with the help of scikit-learn package. Specifically, I have imported some modules as follows:
from pycaret.regression import (setup, compare_models, predict_model, plot_model, finalize_model, load_model)
# setting up the stage to initialize the training environment
s = setup(
data=train,
target=target_var,
ignore_features = ['Series'],
numeric_features=involved_numerics,
categorical_features = categorics,
silent=True,
log_experiment=True,
)
# Now, to train machine learning models, we need to compare models and find the best one
best_model = compare_models(sort='MAE')
# Making some plots
for id, name in zip(ids, names):
plot_model(best_model, plot=id, scale=3, save=True)
.
.
.
I was able to succeed in running the code for some of the models but not all from the list of available models mentioned in the documentation. However, for some specific models (such as Recursive Feat. Selection), there is an error message:
Traceback (most recent call last):
File "c:/Users/username/Desktop/project/project.py", line 55,
in <module>
main()
File "c:/Users/username/Desktop/project/project.py", line 48,
in main
ml_modelling(data, train, test)
File "c:\Users\username\Desktop\project\utilities.py", line 1070, in ml_modelling
plot_model(best_model, plot=id, scale=3, save=True)
File "C:\Users\username\anaconda3\envs\py38\lib\site-packages\pycaret\regression.py", line 1601, in plot_model
return pycaret.internal.tabular.plot_model(
File "C:\Users\username\anaconda3\envs\py38\lib\site-packages\pycaret\internal\tabular.py", line 7712, in plot_model
ret = locals()[plot]()
File "C:\Users\username\anaconda3\envs\py38\lib\site-packages\pycaret\internal\tabular.py", line 6293, in residuals_interactive
resplots.write_html(plot_filename)
File "C:\Users\username\anaconda3\envs\py38\lib\site-packages\pycaret\internal\plots\residual_plots.py", line 673, in write_html
f.write(html)
File "C:\Users\username\anaconda3\envs\py38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25c4' in position 276445: character maps to <undefined>
Here is the train:
Train
Series x y z ID var1 var2 var3 var4 var5 var6
0 1 2 1 3 True -3 -4 6 7 4 6
1 2 2 1 7 False 22 0 3 5 2 8
2 3 2 1 0 True 3 -6 3 5 4 4
3 4 2 1 4 False 27 -4 8 3 -3 2
.
.
.
I am using VSCode to run my python tool on a Windows 10 machine and here is the list of all packages installed on the conda environment:
name: py38
channels:
- conda-forge
- defaults
dependencies:
- bzip2=1.0.8=h8ffe710_4
- ca-certificates=2022.12.7=h5b45459_0
- et_xmlfile=1.1.0=pyhd8ed1ab_0
- libffi=3.4.2=h8ffe710_5
- libsqlite=3.40.0=hcfcfb64_0
- libzlib=1.2.13=hcfcfb64_4
- openpyxl=3.0.10=py38h91455d4_2
- openssl=3.0.7=hcfcfb64_2
- pip=22.3.1=pyhd8ed1ab_0
- python=3.8.15=h4de0772_1_cpython
- python_abi=3.8=3_cp38
- setuptools=66.1.1=pyhd8ed1ab_0
- tk=8.6.12=h8ffe710_0
- ucrt=10.0.22621.0=h57928b3_0
- vc=14.3=hb6edc58_10
- vs2015_runtime=14.34.31931=h4c5c07a_10
- wheel=0.38.4=pyhd8ed1ab_0
- xz=5.2.6=h8d14728_0
- pip:
- alembic==1.9.2
- asttokens==2.2.1
- attrs==22.2.0
- backcall==0.2.0
- blis==0.7.9
- boruta==0.3
- catalogue==1.0.2
- certifi==2022.12.7
- charset-normalizer==3.0.1
- click==8.1.3
- cloudpickle==2.2.1
- colorama==0.4.6
- colorlover==0.3.0
- comm==0.1.2
- contourpy==1.0.7
- cufflinks==0.17.3
- cycler==0.11.0
- cymem==2.0.7
- cython==0.29.14
- databricks-cli==0.17.4
- debugpy==1.6.6
- decorator==5.1.1
- docker==6.0.1
- entrypoints==0.4
- executing==1.2.0
- flask==2.2.2
- fonttools==4.38.0
- funcy==1.18
- future==0.18.3
- gensim==3.8.3
- gitdb==4.0.10
- gitpython==3.1.30
- greenlet==2.0.2
- htmlmin==0.1.12
- idna==3.4
- imagehash==4.3.1
- imbalanced-learn==0.7.0
- importlib-metadata==5.2.0
- importlib-resources==5.10.2
- ipykernel==6.20.2
- ipython==8.9.0
- ipywidgets==8.0.4
- itsdangerous==2.1.2
- jedi==0.18.2
- jinja2==3.1.2
- joblib==1.2.0
- jupyter-client==8.0.1
- jupyter-core==5.1.5
- jupyterlab-widgets==3.0.5
- kiwisolver==1.4.4
- kmodes==0.12.2
- lightgbm==3.3.5
- llvmlite==0.37.0
- mako==1.2.4
- markdown==3.4.1
- markupsafe==2.1.2
- matplotlib==3.6.3
- matplotlib-inline==0.1.6
- mlflow==2.1.1
- mlxtend==0.19.0
- multimethod==1.9.1
- murmurhash==1.0.9
- nest-asyncio==1.5.6
- networkx==3.0
- nltk==3.8.1
- numba==0.54.1
- numexpr==2.8.4
- numpy==1.20.3
- oauthlib==3.2.2
- packaging==22.0
- pandas==1.5.3
- pandas-profiling==3.6.3
- parso==0.8.3
- patsy==0.5.3
- phik==0.12.3
- pickleshare==0.7.5
- pillow==9.4.0
- plac==1.1.3
- platformdirs==2.6.2
- plotly==5.13.0
- preshed==3.0.8
- prompt-toolkit==3.0.36
- protobuf==4.21.12
- psutil==5.9.4
- pure-eval==0.2.2
- pyarrow==10.0.1
- pycaret==2.3.10
- pydantic==1.10.4
- pygments==2.14.0
- pyjwt==2.6.0
- pyldavis==3.3.1
- pynndescent==0.5.8
- pyod==1.0.7
- pyparsing==3.0.9
- python-dateutil==2.8.2
- pytz==2022.7.1
- pywavelets==1.4.1
- pywin32==305
- pyyaml==5.4.1
- pyzmq==25.0.0
- querystring-parser==1.2.4
- regex==2022.10.31
- requests==2.28.2
- scikit-learn==0.23.2
- scikit-plot==0.3.7
- scipy==1.5.4
- seaborn==0.12.2
- shap==0.41.0
- six==1.16.0
- sklearn==0.0.post1
- slicer==0.0.7
- smart-open==6.3.0
- smmap==5.0.0
- spacy==2.3.9
- sqlalchemy==1.4.46
- sqlparse==0.4.3
- srsly==1.0.6
- stack-data==0.6.2
- statsmodels==0.13.5
- tabulate==0.9.0
- tangled-up-in-unicode==0.2.0
- tenacity==8.1.0
- textblob==0.17.1
- thinc==7.4.6
- threadpoolctl==3.1.0
- tornado==6.2
- tqdm==4.64.1
- traitlets==5.8.1
- typeguard==2.13.3
- typing-extensions==4.4.0
- umap-learn==0.5.3
- urllib3==1.26.14
- visions==0.7.5
- waitress==2.1.2
- wasabi==0.10.1
- wcwidth==0.2.6
- websocket-client==1.5.0
- werkzeug==2.2.2
- widgetsnbextension==4.0.5
- wordcloud==1.8.2.2
- yellowbrick==1.2.1
- zipp==3.12.0
prefix: C:\Users\username\anaconda3\envs\py38
It could be probably an issue in the library and the data being loaded having dash in unicode ...
Here is referenced pycaret's source code:
def write_html(self, plot_filename):
"""
Write the current plots to a file in HTML format.
Parameters
----------
plot_filename: str
name of the file
"""
html = self.get_html()
with open(plot_filename, "w") as f:
f.write(html)
And as mentioned in this stackoverflow question
It could be solved by mentioning encoding while opening the file
with open(plot_filename, "w", encoding='utf-8') as f:
f.write(html)
But since you cannot change library's code try running following in console before running your script as mentioned in this answer
chcp 65001
set PYTHONIOENCODING=utf-8

string new line with special charactoers is not working what I exepted

I just created a string such as
str = "TZ=Europe/Berlin\n* * 1-5\n0 5 * * *\n"
I exepted that
TZ=Europe/Berlin
* * 1-5
0 5 * * *
in jenkins cron
but it was not working
any solutions?
Your chron expression doesn't appear valid. The validation error is "Day of month values must be between 1 and 31". Check it here: https://www.freeformatter.com/cron-expression-generator-quartz.html
Also check out https://plugins.jenkins.io/parameterized-scheduler/ for Jenkins specific help.

Adding/changing numbers in specific lines

I have a big file with 250,000 lines and I want to change the numbers every 36 lines.
Example of my file:
rankup_cost_increase_percentage: 0.0
removepermission:
- essentials.warps.B
- essentials.warps.C
- essentials.warps.D
- essentials.warps.E
- essentials.warps.F
- essentials.warps.G
- essentials.warps.H
- essentials.warps.I
- essentials.warps.J
- essentials.warps.K
- essentials.warps.L
- essentials.warps.M
- essentials.warps.N
- essentials.warps.O
- essentials.warps.P
- essentials.warps.Q
- essentials.warps.R
- essentials.warps.S
- essentials.warps.T
- essentials.warps.U
- essentials.warps.W
- essentials.warps.X
- essentials.warps.Y
- essentials.warps.Z
executecmds:
- "[CONSOLE] crate give to %player% Legendary 1"
- "[CONSOLE] crate give to %player% Multiplier 1"
- "[player] warp A"
P7437:
nextprestige: P7438
cost: 3.7185E13
display: '&8[&9P7437&8]'
rankup_cost_increase_percentage: 0.0
I want rankup_cost_increase_percentage: 0.0 to increase by 5.0 everytime.
How would I be able to do that?

Crontab every hour from 7 am to 12 pm

I need to run script every hour from 7 am to midnight (including midnight).
I've created this crontab but it didn't do anything at 7 and 8 am.
0 7-0/1 * * * ~/.venvs/p/bin/python ~/p/manage.py post_from_queue >> ~/p/logs/posts.log
Do you know where is the problem?
The proper format is :
0 0,7-23 * * * ~/.venvs/p/bin/python ~/p/manage.py post_from_queue >> ~/p/logs/posts.log
If this do not work on your case you should set two cron records:
0 7-23 * * * ~/.venvs/p/bin/python ~/p/manage.py post_from_queue >> ~/p/logs/posts.log
0 0 * * * ~/.venvs/p/bin/python ~/p/manage.py post_from_queue >> ~/p/logs/posts.log
It is wise to create script with the command above. And use absolute paths instead of relative

Run Cron at Minute 0 of Specific Hour Set

What I'm trying to do is run a Cron job at specific hours such as 1,9,13,16 but only once for each of those hours. Setting it up every few hours doesn't work for me because it needs to be at specific hours.
This is what I'm currently using but it doesn't run: 0 1,9,13,16 * * *
In order to get it to run I have to use this: 0 * * * * or * * * * *
Any ideas?
0 1,9,13,16 * * * is a perfectly valid cron expression (I've just checked it with jailshell, although I was confident). It seems to me you have a problem somewhere else. Try settings the cron job using crontab -e and make a quick test with * * * * * wget google.com to see if it works at all.
Also here is a online cron job expression validator if you need it: http://www.unitedmindset.com/jonbcampos/2009/07/29/custom-validators-cron-job-expression-validator/
1 1,9,13,16 * * *
Try minute 1 instead of 0. It works for me.
But this 0 14 * * 1-5 works for me too, or this 0 1,13 * * *.

Resources