How to reduce the cognitive complexity in Python3 - python-3.x

I have a question regarding this piece of code:
doc = nlp(text)
words = nlp(text).ents[0]
for entity in doc.ents:
self.entity_list = [entity]
left = [
{'Left': str(words[entity.start - 1])} if words[entity.start - 1] and not words[entity.start - 1].is_punct and not
words[entity.start - 1].is_space
else
{'Left': str(words[entity.start - 2])} if words[entity.start - 2] and not words[entity.start - 2].is_punct and not
words[entity.start - 2].is_space
else
{'Left': str(words[entity.start - 3])} for entity in nlp(text).ents]
entities = [{'Entity': str(entity)} for entity in doc.ents]
right = [
{'Right': str(words[entity.end])} if (entity.end < self.entity_list[-1].end) and not words[
entity.end].is_punct and not words[entity.end].is_space
else
{'Right': str(words[entity.end + 1])} if (entity.end + 1 < self.entity_list[-1].end) and not words[
entity.end + 1].is_punct and not words[entity.end + 1].is_space
else
{'Right': str(words[entity.end + 2])} if (entity.end + 2 < self.entity_list[-1].end) and not words[
entity.end + 2].is_punct and not words[entity.end + 2].is_space
else
{'Right': 'null'}
for entity in nlp(text).ents]
I was asking for a solution a few days ago, regarding obtaining side words of an entity with SpaCy in Python3.
I found the solution and updated my question with the answer. However, it looks very complicated and ugly.
My question is:
How can I reduce the cognitive complexity here in order to get more clean and readable code?
Maybe with iterator? or something that Python3 has to control this kind of structures better?
If anyone has a solution or suggestion for that, I would appreciate.

You should both move the computation of indexes to dedicated functions and iterate instead of manually listing
def get_left_index(entity, words):
for i in range(1, 3):
if (
words[entity.start - i]
and not words[entity.start - i].is_punct
and not words[entity.start - i].is_space
):
return entity.start - i
return entity.start - (i + 1)
def get_right_index(entity, entity_list, words):
for i in range(3):
if (
(entity.end + i < entity_list[-1].end)
and not words[entity.end + i].is_punct
and not words[entity.end + i].is_space
):
return entity.end + i
left = [
{"Left": str(words[get_left_index(entity, words)])} for entity in nlp(text).ents
]
entities = [{"Entity": str(entity)} for entity in doc.ents]
right = [
{"Right": str(words[get_right_index(entity, self.entity_list, words)])}
if get_right_index(entity, self.entity_list, words) is not None
else {"Right": "null"}
for entity in nlp(text).ents
]

Related

Longest Common Substring (using Intuition for Longest Common Subsequence)

I've been trying to learn Dynamic Programming. And I have come across two seemingly similar problems "Longest Common Subsequence" and "Longest Common Substring"
So we assume we have 2 strings str1 and str2.
For Longest Common Subsequence, we create the dp table as such:
if str1[i] != str2[j]:
dp[i][j] = max(dp[i-1][j], sp[i][j-1])
else:
dp[i][j] = 1 + dp[i-1][j-1]
Following the same intuition, for "Longest Common Substring" can we do the following:
if str1[i] != str2[j]:
dp[i][j] = max(dp[i-1][j], sp[i][j-1])
else:
if str1[i-1] == str2[j-1]:
dp[i][j] = 1 + dp[i-1][j-1]
else:
dp[i][j] = 1 + dp[i-1][j-1]
The check if str1[i-1] == str2[j-1] confirms that we are checking for substrings and not subsequence
I didn't understand what you are asking, but I'll try to give a good explanation for the Longest Common Substring.
Let DP[x][y] be the maximum common substring, considering str1[0..x] and str2[0..y].
Consider that we are computing DP[a][b], we always have the possibility of not using this character = max( DP[a][b - 1], DP[a - 1][b] ) and if str1[a] == str2[b] we can also take the answer of DP[a - 1][b - 1] + 1 ( this +1 exist because we have found a new matching character )
// This does not depend on s[i] == s[j]
dp[i][j] = max( dp[i][j - 1], dp[i - 1][j] )
if str1[i] == str2[j]:
dp[i][j] = max( dp[i][j], dp[i - 1][j - 1] + 1 )

How to reduce memory use, searching for the kth character in a sequence?

Problem Statement:
A teacher once said: "Good writing is good writing is good writing."
Hence, the teacher defines f0= "Good writing is good writing is good writing."
To make the quote more interesting the teacher defines fn= "Good writing is good " +fn−1+ " writing is good " +fn−1+ " is good writing." for all n≥1
For example, f1 is:
Good writing is good Good writing is good writing is good writing. writing is good Good writing is good writing is good writing. is good writing.
Note that the quotation marks are not part of f1.
The teacher wants to ask q questions. Each time she wants to find the k-th character of fn.
Characters are indexed starting at 1. If fn consists of less than k characters, output .
In all tests,
1≤q≤10
0≤n≤30
1≤k≤231−1
For example:
input:
3
0 4
1 100
1 1111111
output:
d
g
.
input:
3
0 6
1 13
1 22
output:
w
G
My problem:
What I've done so far is simply create an array that stores the precomputed strings of f0 to f31.
String[] f = new String[31];
f[0] = "Good writing is good writing is good writing.";
f[1] = "Good writing is good Good writing is good writing is good writing. writing is good Good writing is good writing is good writing. is good writing.";
for(int i = 2; i < 31; i++) {
f[i] = "Good writing is good " + f[i-1] + " writing is good " + f[i-1] + " is good writing.";
}
Once I've precomputed the max of 31, I query the input:
while(q-->0) {
int n = readInt();
int k = readInt();
System.out.println(f[n]);
if(f[n].length() < k) {
System.out.println(".");
} else {
System.out.println(f[n].charAt(k-1));
}
}
Now the problem with this is that when do this, I get a Out of Memory Error. This led me to think that there is a much faster and easier way of doing this question. I feel like there is a pattern, but I might be wrong. Any thoughts?
One idea is to determine where in the added sections the kth character is. For that we can make the previous call return the length of its result rather than the result itself in addition to the kth character. Then we can decide where in the sections the kth falls. The following code also returns the string but just for purposes of demonstration.
You can see that in our function, f(n, k), if we determine the kth character falls inside one of the the sections that are f(n - 1), we subtract from k the length of the preceding section/s and obtain kth from f(n - 1, k - length_of_prefix) since it's just like looking for the (new) kth inside f(n - 1). We perform that search recursively as needed.
JavaScript code:
// Returns [len_fn, kth, fn]
function f(n, k){
const f_0 = "Good writing is good writing is good writing."
const len_0 = f_0.length
const str_1 = "Good writing is good "
const str_2 = " writing is good "
const str_3 = " is good writing."
const len_1 = str_1.length
const len_2 = str_2.length
const len_3 = str_3.length
if (n == 0)
return [len_0, f_0[k-1], f_0]
const [len_fn1, kth1, fn1] = f(n - 1)
const fn = str_1 + fn1 + str_2 + fn1 + str_3
const len_fn = len_1 + len_fn1 + len_2 + len_fn1 + len_3
const pos_fn1_2 = len_1 + len_fn1 + len_2
if (k <= len_1)
return [len_fn, str_1[k-1], fn]
// kth is in f(n - 1)
else if (k <= len_1 + len_fn1){
const kth = f(n - 1, k - len_1)[1]
return [len_fn, kth, fn]
}
else if (k <= pos_fn1_2)
return [len_fn, str_2[k-len_1-len_fn1-1], fn]
// kth is in f(n - 1)
else if (k <= pos_fn1_2 + len_fn1){
const kth = f(n - 1, k - pos_fn1_2)[1]
return [len_fn, kth, fn]
}
else
return [len_fn, str_3[k-pos_fn1_2-len_fn1-1], fn]
}
var result = f(2, 253)
console.log(JSON.stringify(result))
console.log("")
console.log(JSON.stringify(
result[2].split("").map((x, i) => [i + 1, x])))

Multithreading in scala

I was given a challenge recently in school to create a simple program in Scala the does some calculations in a matrix, the thing is I have to do these calculations using 5 threads, since I had no prior knowledge of Scala I am stuck. I searched online but I did not find how to create the exact number of threads I want. This is the code:
import scala.math
object Test{
def main(args: Array[String]){
val M1: Seq[Seq[Int]] = List(
List(1, 2, 3),
List(4, 5, 6),
List(7, 8, 9)
)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
def calc(i:Int, j:Int): Int ={
if((i<0)|| (j<0) || (i>M1.length-1))
return 0
else{
count +=1
return M1(i)(j)}
}
}
I tried this:
for (a <- 0 until 1) {
val thread = new Thread {
override def run {
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
}
}
thread.start
}
but it only executed the same thing 10 times
Here's the original core of the calculation.
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
Let's actually build a result array
val R = Array.ofDim[Int](M1.length, M1(0).length)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
R(i)(j) = math.ceil(tempData/count).toInt
}
Now, that mutable count modified in one function and referenced in another is a bit of a code smell. Let's remove it - change calc to return an option, assemble a list of the things to average, and flatten to keep only the Some
val R = Array.ofDim[Int](M1.length, M1(0).length)
for (i <- 0 to M1.length - 1; j <- 0 to M1(0).length - 1) {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
R(i)(j) = math.ceil(tempList.sum.toDouble / tempList.length).toInt
}
def calc(i: Int, j: Int): Option[Int] = {
if ((i < 0) || (j < 0) || (i > M1.length - 1))
None
else {
Some(M1(i)(j))
}
}
Next, a side-effecting for is a bit of a code smell too. So in the inner loop, let's produce each row and in the outer loop a list of the rows...
val R = for (i <- 0 to M1.length - 1) yield {
for (j <- 0 to M1(0).length - 1) yield {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
Now, we read the Scala API and we notice ParSeq and Seq.par so we'd like to work with map and friends. So let's un-sugar the for comprehensions
val R = (0 until M1.length).map { i =>
(0 until M1(0).length).map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
This is our MotionBlurSingleThread. To make it parallel, we simply do
val R = (0 until M1.length).par.map { i =>
(0 until M1(0).length).par.map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}.seq
}.seq
And this is our MotionBlurMultiThread. And it is nicely functional too (no mutable values)
The limit to 5 or 10 threads isn't in the challenge on Github, but if you need to do that you can look at scala parallel collections degree of parallelism and related questions
I am not an expert, neither on Scala nor on concurrency.
Scala approach to concurrency is through the use of actors and messaging, you can read a little about that here, Programming in Scala, chapter 30 Actors and Concurrency (the first edition is free but it is outdated). As I was telling, the edition is outdated and in the latest version of Scala (2.12) the actors library is no longer included, and they recommend to use Akka, you can read about that here.
So, I would not recommend learning about Scala, Sbt and Akka just for a challenge, but you can download an Akka quickstart here and customize the example given to your needs, it is nicely explained in the link. Each instance of the Actor has his own thread. You can read about actors and threads here, in specific, the section about state.

Finding similar/related texts algorithms

I searched a lot in stackoverflow and Google but I didn't find the best answer for this.
Actually, I'm going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)
I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.
What's the best algorithm for doing this?
A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.
This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.
This is one: http://en.wikipedia.org/wiki/Levenshtein_distance
public static SqlInt32 ComputeLevenstheinDistance(SqlString firstString, SqlString secondString)
{
int n = firstString.Value.Length;
int m = secondString.Value.Length;
int[,] d = new int[n + 1,m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (secondString.Value[j - 1] == firstString.Value[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
This is handy for the task at hand: http://code.google.com/p/boilerpipe/
Also, if you need to reduce the number of words to analyze, try this: http://ots.codeplex.com/
I have found the OTS VERY useful in sentiment analysis, whereby I can reduce the number of sentences into a small list of common phrases and/or words and calculate the overall sentiment based on this. The same should work for similarity.

How to highlight the differences between subsequent lines in a file?

I do a lot of urgent analysis of large logfile analysis. Often this will require tailing a log and looking for changes.
I'm keen to have a solution that will highlight these changes to make it easier for the eye to track.
I have investigated tools and there doesn't appear to be anything out there that does what I am looking for. I've written some scripts in Perl that do it roughly, but I would like a more complete solution.
Can anyone recommend a tool for this?
Levenshtein distance
Wikipedia:
Levenshtein distance between two strings is minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.
public static int LevenshteinDistance(char[] s1, char[] s2) {
int s1p = s1.length, s2p = s2.length;
int[][] num = new int[s1p + 1][s2p + 1];
// fill arrays
for (int i = 0; i <= s1p; i++)
num[i][0] = i;
for (int i = 0; i <= s2p; i++)
num[0][i] = i;
for (int i = 1; i <= s1p; i++)
for (int j = 1; j <= s2p; j++)
num[i][j] = Math.min(Math.min(num[i - 1][j] + 1,
num[i][j - 1] + 1), num[i - 1][j - 1]
+ (s1[i - 1] == s2[j - 1] ? 0 : 1));
return num[s1p][s2p];
}
Sample App in Java
String Diff
Application uses LCS algorithm to concatenate 2 text inputs into 1. Result will contain minimal set of instructions to make one string for the other. Below the instruction concatenated text is displayed.
Download application:
String Diff.jar
Download source:
Diff.java
I wrote a Python script for this purpose that utilizes difflib.SequenceMatcher:
#!/usr/bin/python3
from difflib import SequenceMatcher
from itertools import tee
from sys import stdin
def pairwise(iterable):
"""s -> (s0,s1), (s1,s2), (s2, s3), ...
https://docs.python.org/3/library/itertools.html#itertools-recipes
"""
a, b = tee(iterable)
next(b, None)
return zip(a, b)
def color(c, s):
"""Wrap string s in color c.
Based on http://stackoverflow.com/a/287944/1916449
"""
try:
lookup = {'r':'\033[91m', 'g':'\033[92m', 'b':'\033[1m'}
return lookup[c] + str(s) + '\033[0m'
except KeyError:
return s
def diff(a, b):
"""Returns a list of paired and colored differences between a and b."""
for tag, i, j, k, l in SequenceMatcher(None, a, b).get_opcodes():
if tag == 'equal': yield 2 * [color('w', a[i:j])]
if tag in ('delete', 'replace'): yield color('r', a[i:j]), ''
if tag in ('insert', 'replace'): yield '', color('g', b[k:l])
if __name__ == '__main__':
for a, b in pairwise(stdin):
print(*map(''.join, zip(*diff(a, b))), sep='')
Example input.txt:
108 finished /tmp/ts-out.5KS8bq 0 435.63/429.00/6.29 ./eval.exe -z 30
107 finished /tmp/ts-out.z0tKmX 0 456.10/448.36/7.26 ./eval.exe -z 30
110 finished /tmp/ts-out.wrYCrk 0 0.00/0.00/0.00 tail -n 1
111 finished /tmp/ts-out.HALY18 0 460.65/456.02/4.47 ./eval.exe -z 30
112 finished /tmp/ts-out.6hdkH5 0 292.26/272.98/19.12 ./eval.exe -z 1000
113 finished /tmp/ts-out.eFBgoG 0 837.49/825.82/11.34 ./eval.exe -z 10
Output of cat input.txt | ./linediff.py:
http://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html
.. this look promising, will update this with more info when Ive played more..

Resources