spaCy sentence separation with dictionary source from OpenAI Whisper / WhisperX? - nlp

WhisperX is a whisper extension that does a really excellent job of text to speech with per-word timestamps.
I'd like to use spaCy to split up the text strings into sensible clauses but maintain a connection to the source dictionary so the result can inform subtitles and other video editing tools.
Is there a pathway to do this in spaCy? Most of the examples I see expect a text string of input.
My source dictionaries from WhisperX will be many multiples of this type of thing:
"word-level": [ {"text": "So", "start": 27.80031007751938, "end": 27.940671834625324}, { "text": "we've", "start": 27.98077519379845, "end": 28.201343669250647, }, {"text": "got", "start": 28.301602067183463, "end": 28.502118863049095}, { "text": "enough", "start": 28.58232558139535, "end": 28.842997416020673, }, { "text": "books.", "start": 28.983359173126615, "end": 29.223979328165374, }, {"text": "We", "start": 29.2640826873385, "end": 29.364341085271317}, { "text": "hopefully", "start": 29.384392764857882, "end": 29.765374677002583, }, { "text": "enough", "start": 29.885684754521964, "end": 30.18645994832041, }, {"text": "to", "start": 30.607545219638244, "end": 30.72785529715762}, {"text": "be", "start": 30.767958656330748, "end": 30.868217054263564}, {"text": "able", "start": 30.96847545219638, "end": 31.128888888888888}, {"text": "to", "start": 31.168992248062015, "end": 31.26925064599483}, { "text": "share", "start": 31.349457364341085, "end": 31.590077519379843, }, {"text": "some", "start": 31.670284237726097, "end": 31.81064599483204}, {"text": "of", "start": 31.850749354005167, "end": 31.890852713178294}, { "text": "these", "start": 31.93095607235142, "end": 32.071317829457364, }, {"text": "books", "start": 32.15152454780362, "end": 32.41219638242894}, {"text": "with", "start": 32.51245478036176, "end": 32.692919896640824}, {"text": "the", "start": 32.71297157622739, "end": 32.79317829457364}, { "text": "children", "start": 32.833281653746766, "end": 33.194211886304906, }, {"text": "our", "start": 33.254366925064595, "end": 33.37467700258398}, { "text": "community", "start": 33.43483204134367, "end": 33.73560723514212, }, {"text": "at", "start": 33.81581395348837, "end": 33.89602067183462}, {"text": "our", "start": 33.956175710594316, "end": 34.13664082687338}, { "text": "annual", "start": 34.216847545219636, "end": 34.4574677002584, }, { "text": "Christmas", "start": 34.49757105943152, "end": 34.75824289405685, }, {"text": "on", "start": 34.79834625322997, "end": 34.838449612403096}, {"text": "the", "start": 34.87855297157623, "end": 34.938708010335915}, { "text": "Boulevard", "start": 34.95875968992248, "end": 35.25953488372093, }, {"text": "event", "start": 35.27958656330749, "end": 35.37984496124031}, ]
I've played around with keying in and out of the dictionary, but this feels like a problem that already has a solution.

You can create docs from external tokenization like this:
import spacy
from spacy.tokens import Doc
nlp = spacy.blank("en")
doc = Doc(nlp.vocab, words=[entry["text"] for entry in entries])
If you need to, you can add additional metadata to the tokens for alignment:
from spacy.tokens import Token
Token.set_extension("my_metadata", default=None)
for i, entry in enumerate(entries):
doc[i]._.my_metadata = entry["some_index"]
If you're using this input for training, you might want to add this flag to the doc, so that the pipeline components know that the spaces between the tokens in the default doc.text aren't significant:
doc.has_unknown_spaces = True
We've and "books." are not tokenized the same way by default for the English pipelines, so provided pipelines like en_core_web_sm won't do a particularly great job of analyzing this input, especially around sentence boundaries and tokens like "books." where it would expect the tokens ["books", "."].

Related

How to 'sqlite3' run after build exe with electron-builder

I have build my electron app with help of https://medium.com/jspoint/packaging-and-distributing-electron-applications-using-electron-builder-311fc55178d9
it was was success (windows only). but after install published app, i am getting error as shown in screenshort
my scripts as below
package.json
"name": "aux-services",
"version": "1.0.0",
"description": "Mobile Repair Tracking System",
"main": "main.js",
"scripts": {
"start": "electron .",
"postinstall": "electron-builder install-app-deps",
"pack": "electron-builder -w"
},
"repository": {
"type": "git",
"url": "git+https://github.com/shafeequeot/Mobile-Service-Tracker.git"
},
"author": "AuxWall",
"email": "shafeequeot#gmail.com",
"url": "https://auxwall.com",
"license": "MIT",
"bugs": {
"url": "https://github.com/shafeequeot/Mobile-Service-Tracker/issues"
},
"homepage": "https://github.com/shafeequeot/Mobile-Service-Tracker#readme",
"devDependencies": {
"electron": "^11.1.1",
"electron-builder": "^22.14.13",
"sqlite3": "^5.0.2"
},
"dependencies": {
}
}
electron-builder.json
{
"appId": "com.auxWall.service",
"productName": "Aux Services",
"copyright": "AuxWall",
"directories": {
"app": ".",
"output": "out",
"buildResources": "build-res"
},
"files": [
"package.json",
"**/*",
"node_modules"
],
"dmg": {
"background": null,
"backgroundColor": "#ffffff",
"window": {
"width": "400",
"height": "300"
},
"contents": [
{
"x": 100,
"y": 100
},
{
"x": 300,
"y": 100,
"type": "link",
"path": "/Applications"
}
]
},
"mac": {
"target": "dmg",
"category": "public.auxWall.services"
},
"win": {
"target": "nsis"
},
"linux": {
"target": "AppImage",
"category": "Utility"
}
}
can anybody help me to resolve this issue?
If sqlite3 is required during normal operation of your Electron application and not just during development then you will need to added sqlite3 as a dependency.
IE: Move "sqlite3": "^5.0.2" from "devDependencies": { ... } to "dependencies": { ... }.
package.json
{
"name": "aux-services",
"version": "1.0.0",
"description": "Mobile Repair Tracking System",
"main": "main.js",
"scripts": {
"start": "electron .",
"postinstall": "electron-builder install-app-deps",
"pack": "electron-builder -w"
},
"repository": {
"type": "git",
"url": "git+https://github.com/shafeequeot/Mobile-Service-Tracker.git"
},
"author": "AuxWall",
"email": "shafeequeot#gmail.com",
"url": "https://auxwall.com",
"license": "MIT",
"bugs": {
"url": "https://github.com/shafeequeot/Mobile-Service-Tracker/issues"
},
"homepage": "https://github.com/shafeequeot/Mobile-Service-Tracker#readme",
"devDependencies": {
"electron": "^11.1.1",
"electron-builder": "^22.14.13"
},
"dependencies": {
"sqlite3": "^5.0.2"
}
}

Jest coverage data not showing in report

I'm trying to get test results and code coverage data after running jest tests. The resulting file contains the results of the tests, but no coverage data. I'm using jest-junit as my reporter. Here's what's in my package.json:
"scripts": {
"test": "jest --verbose --silent --coverage --coverageDirectory=./",
"start": "node server.js",
"dev": "nodemon server.js"
},
"jest": {
"testEnvironment": "node",
"coveragePathIgnorePatterns": [
"/node_modules/"
],
"coverageReporters": [
"text",
"jest-junit"
],
"reporters": [
"default",
"jest-junit"
]
},
"jest-junit": {
"suiteName": "jest tests",
"outputDirectory": ".",
"outputName": "junit.xml",
"uniqueOutputName": "false",
"classNameTemplate": "{classname}-{title}",
"titleTemplate": "{classname}-{title}",
"ancestorSeparator": " › ",
"usePathForSuiteName": "true"
}

How to configure ESlint to work both Mac and WIndows machines

I have a Nodejs project which uses ESLint to keep consistency.
On my Mac machine, I have no troubles all works bur on Windows I got this error
No files matching the pattern "'./*'" were found.
Please check for typing mistakes in the pattern.
My setup for ESLint is
{
"env": {
"es6": true,
"node": true
},
"extends": [
"plugin:prettier/recommended",
"airbnb-base"
],
"plugins": [
"prettier"
],
"globals": {
"Atomics": "readonly",
"SharedArrayBuffer": "readonly"
},
"parserOptions": {
"ecmaVersion": 2018,
"sourceType": "module"
},
"rules": {
"prettier/prettier": "error",
"linebreak-style": "off"
}
}
Package.json
{
"name": "new-architecture-solution",
"version": "1.0.0",
"description": "",
"main": "server.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"prod": "node -r esm server.js",
"dev": "nodemon -r esm server.js",
"debug": "ndb nodemon -r esm server.js",
"lint": "eslint . --ext .js,.jsx --quiet",
"fix": "eslint './*' --fix",
"prettier": "prettier --write src/**/*.{js,css}"
},
"husky": {
"hooks": {
"pre-commit": "lint-staged"
}
},
"eslintIgnore": [
"package.json",
"package-lock.json",
"combined.log",
"swagger.json",
"README.md"
],
"lint-staged": {
"./**/*.{js,jsx,ts,tsx,json,css,scss,md}": [
"npm run prettier",
"npm run lint --color",
"npm run fix",
"git add"
]
},
I'm unable to find asolution and I would like to have it work in both my machines
I haven't tried it yet on Windows, but according to this post replacing the single quotes with \" might do the trick. I've tried it on my Mac and it seems to work properly.
Edit: confirmed to work on Windows machines as well.

How to watch multiple folders with nodemon?

I need something like this:
nodemon.json:
[{
"watch": ["src/api-gateway"],
"ext": "ts",
"ignore": ["src/**/*.spec.ts"],
"exec": "ts-node ./src/api-gateway/main.ts"
},
{
"watch": ["src/services/ping-service"],
"ext": "ts",
"ignore": ["src/**/*.spec.ts"],
"exec": "ts-node ./src/services/ping-service/ping-service.ts"
}]
Is that possible, or if there is some alternative way to do it?
Use the array for watch option
"watch": [
"folder1",
"folder2",
]
You can look at the sample here https://github.com/remy/nodemon/blob/master/doc/sample-nodemon.md

error missing script: nodemon

I was following tutorial that I have found here. Previously the command was running fine, but after some changes when I runnpm run nodemon it gives me error
Here is my package.json
{
"name": "smashing-react-i18n",
"version": "1.0.0",
"description": "",
"main": "dist/bundle.js",
"betterScripts": {
"build": {
"command": "webpack -p",
"env": {
"NODE_ENV": "production"
}
},
"nodemon": {
"command": "nodemon server.js",
"env": {
"NODE_PATH": "src"
}
}
},

Resources