Related
We have data like this
input = {
'a': 3,
'b': {'g': {'l': 12}},
'c': {
'q': 3,
'w': {'v': 3},
'r': 8,
'g': 4
},
'd': 4
}
It is not known in advance how many nesting levels there will be
We need to get the full address to the final value, all points of which are separated by a dot, or another special character
Like this:
a:3
b.g.l: 12
c.q: 3
c.w.v: 3
etc
I tried to solve this problem with a recursive function.
def recursive_parse(data: dict, cache: Optional[list]=None):
if cache is None:
cache = []
for k in data:
cache.append(k)
if not isinstance(data[k], dict):
print(f"{'.'.join(cache) } :{data[k]}")
cache.clear()
else:
recursive_parse(data[k], cache)
But I have problems with "remembering" the previous key of the nested dictionary.
a :3
b.g.l :12
c.q :3
w.v :3
r :8
g :4
d :4
What is the correct algorithm to solve this?
It's probably better to use an explicit stack for this, rather than the Python call stack. Recursion is slow in Python, due to high function call overhead, and the recursion limit is fairly conservative.
def dotted(data):
result = {}
stack = list(data.items())
while stack:
k0, v0 = stack.pop()
if isinstance(v0, dict):
for k1, v1 in v0.items():
item = ".".join([k0, k1]), v1
stack.append(item)
else:
result[k0] = v0
return result
Demo:
>>> data
{'a': 3,
'b': {'g': {'l': 12}},
'c': {'q': 3, 'w': {'v': 3}, 'r': 8, 'g': 4},
'd': 4}
>>> for k, v in reversed(dotted(data).items()):
... print(k, v)
...
a 3
b.g.l 12
c.q 3
c.w.v 3
c.r 8
c.g 4
d 4
Try:
dct = {
"a": 3,
"b": {"g": {"l": 12}},
"c": {"q": 3, "w": {"v": 3}, "r": 8, "g": 4},
"d": 4,
}
def parse(d, path=None):
if path is None:
path = []
if isinstance(d, dict):
for k, v in d.items():
yield from parse(v, path + [k])
else:
yield "{}: {}".format(".".join(path), d)
for p in parse(dct):
print(p)
Prints:
a: 3
b.g.l: 12
c.q: 3
c.w.v: 3
c.r: 8
c.g: 4
d: 4
I have dataframe:
d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
({'Mayor': 2, 'Indiana': 2}, 4),
({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])
df1.show()
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']
I want to write two functions:
first function filters keys for the dct column that are not in the ignore_list and the second function filters if the keys are in filter_lst
Thus there will be two columns that contain dictionaries with keys filtered by ignore_list and filter_lst
These two UDFs should be sufficient for your case:
from pyspark.sql.functions import col
d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
({'Mayor': 2, 'Indiana': 2}, 4),
({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']
df1 = spark.createDataFrame(d1, ['dct', 'count'])
#udf
def apply_ignore_lst(dct):
return {k:v for k, v in dct.items() if k not in ignore_lst}
#udf
def apply_filter_lst(dct):
return {k:v for k, v in dct.items() if k in filter_lst}
df1.withColumn("apply_ignore_lst", apply_ignore_lst(col("dct"))).withColumn("apply_filter_lst", apply_filter_lst(col("apply_ignore_lst"))).show(truncate=False)
+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|dct |count|apply_ignore_lst |apply_filter_lst|
+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|{the town -> 1, County Council s -> 2, email -> 5} |2 |{the town=1, email=5} |{} |
|{Indiana -> 2, Mayor -> 2} |4 |{Mayor=2} |{Mayor=2} |
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|6 |{Congress=2, Justice=2, country=2, veterans=1}|{Congress=2} |
+----------------------------------------------------------+-----+----------------------------------------------+----------------+
It can be done in one-liner using map_filter:
df1 \
.withColumn("ignored", F.map_filter("dct", lambda k, _: ~k.isin(ignore_lst))) \
.withColumn("filtered", F.map_filter("dct", lambda k, _: k.isin(filter_lst)))
Full example:
d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
({'Mayor': 2, 'Indiana': 2}, 4),
({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])
ignore_lst = ['County Council s', 'emal', 'Indiana']
filter_lst = ['Congress', 'town', 'Mayor', 'Indiana']
df1 = df1 \
.withColumn("ignored", F.map_filter("dct", lambda k, _: ~k.isin(ignore_lst))) \
.withColumn("filtered", F.map_filter("dct", lambda k, _: k.isin(filter_lst)))
[Out]:
+----------------------------------------------------------+--------------------------+
|ignored |filtered |
+----------------------------------------------------------+--------------------------+
|{the town -> 1, email -> 5} |{} |
|{Mayor -> 2} |{Indiana -> 2, Mayor -> 2}|
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|{Congress -> 2} |
+----------------------------------------------------------+--------------------------+
For example, I have two dictionaries having the same keys:
a = {"a": 1, "b": 2, "c":4.5, "d":[1,2], "e":"string", "f":{"f1":0.0, "f2":1.5}}
b = {"a": 10, "b": 20, "c":3.5, "d":[0,2,4], "e":"q", "f":{"f1":1.0, "f2":0.0}}
and I want to compare the types. My code is something like this:
if type(a["a"]) == type(b["a"]) and type(a["b"]) == type(b["b"]) and type(a["c"]) == type(b["c"]) and type(a["d"]) == type(b["d"]) and type(a["e"]) == type(b["e"]) and type(a["f"]) == type(b["f"]) and type(a["f"]["f1"]) == type(b["f"]["f1"]) and type(a["f"]["f2"]) == type(b["f"]["f2"]):
first_type = type(b["d"][0])
if all( (type(x) is first_type) for x in a["d"] )
#do something
pass
Is there a better way to do it?
You can make a list of the common keys between the dicts:
common_keys = a.keys() & b.keys()
and then iterate over them to check the types:
for k in common_keys:
if type(a[k]) == type(b[k]):
print("Yes, same type! " + k, a[k], b[k])
else:
print("Nope! " + k, a[k], b[k])
and if you wanted to go deeper, check if any of the items are dicts, rinse an repeat
for k in common_keys:
if type(a[k]) == type(b[k]):
print("Yes, same type! " + k, type(a[k]), type(b[k]))
if isinstance(a[k], dict):
ck = a[k].keys() & b[k].keys()
for key in ck:
if type(a[k][key]) == type(b[k][key]):
print("Yes, same type! " + key, type(a[k][key]), type(b[k][key]))
else:
print("Nope!")
else:
print("Nope! " + k, type(a[k]), type(b[k]))
You can use a for loop to iterate through the dicts:
same_types = True
for key in a.keys():
if type(a[key]) != type(b[key]):
same_types = False
break
# if the value is a dict, check nested value types
if type(a[key]) == dict:
for nest_key in a[key].keys():
if type(a[key][nest_key]) != type(b[key][nest_key]):
same_types = False
break
# if the value is a list, check all list elements
# I just simply concat two lists together, you can also refer to
# https://stackoverflow.com/q/35554208/19322223
elif type(a[key]) == list:
first_type = a[key][0]
for elem in a[key] + b[key]:
if type(elem) != first_type:
same_types = False
break
if not same_types:
break
if same_types:
# do something
With the following helper function:
def get_types(obj, items=None):
"""Function that recursively traverses 'obj' and returns
a list of all values and nested values types
"""
if not items:
items = []
if isinstance(obj, dict):
for value in obj.values():
if not isinstance(value, (dict, list, set, tuple)):
items.append(value)
else:
get_types(value, items)
elif isinstance(obj, (list, set, tuple)):
for value in obj:
get_types(value, items)
else:
items.append(obj)
return [type(x) for x in items]
You can compare two dictionaries' values types however deeply nested these are, like this:
if get_types(a) == get_types(b):
print("Each a and b values are of same types")
Since, in your example, a misses one value for d key ([1, 2]) compared to the other dict ([0, 2, 4]), nothing will be printed.
Let's take another example where both dictionaries have the same shape this time, but one value of different type (f2):
a = {"a": 1, "b": [[1, 2], [3, [4]]], "c": {"c1": 0.0, "c2": {"x": "9"}}}
b = {"d": 7, "e": [[2, 1], [5, [7]]], "f": {"f1": 8.9, "f2": {"y": 9}}}
if get_types(a) == get_types(b):
print("Each a and b values are of same types")
Then again, nothing will be printed.
But if you replace 9 by "9" in b["f2"]:
a = {"a": 1, "b": [[1, 2], [3, [4]]], "c": {"c1": 0.0, "c2": {"x": "9"}}}
b = {"d": 7, "e": [[2, 1], [5, [7]]], "f": {"f1": 8.9, "f2": {"y": "9"}}}
if get_types(a) == get_types(b):
print("Each a and b values are of same types")
# Output
# Each a and b values are of same types
I have 5 lists of words. I need to find all words occurring in more than 2 lists. Any word can occur multiple times in a list.
I have used collections.Counter but it only returns the frequencies of all the words in individual lists.
a = ['wood', 'tree', 'bark', 'log']
b = ['branch', 'mill', 'boat', 'boat', 'house']
c = ['log', 'tree', 'water', 'boat']
d = ['water', 'log', 'branch', 'water']
e = ['branch', 'rock', 'log']
For example, the output from these lists should be ['log':4, 'branch':3] as 'log' is present in 4 lists and 'branch' in 3.
Without Counter:
a = ['wood', 'tree', 'bark', 'log']
b = ['branch', 'mill', 'boat', 'boat', 'house']
c = ['log', 'tree', 'water', 'boat']
d = ['water', 'log', 'branch', 'water']
e = ['branch', 'rock', 'log']
all_lists = [a, b, c, d, e]
all_words = set().union(w for l in all_lists for w in l)
out = {}
for word in all_words:
s = sum(word in l for l in all_lists)
if s > 2:
out[word] = s
print(out)
Prints:
{'branch': 3, 'log': 4}
Edit (to print the names of lists):
a = ['wood', 'tree', 'bark', 'log']
b = ['branch', 'mill', 'boat', 'boat', 'house']
c = ['log', 'tree', 'water', 'boat']
d = ['water', 'log', 'branch', 'water']
e = ['branch', 'rock', 'log']
all_lists = {'a':a, 'b':b, 'c':c, 'd':d, 'e':e}
all_words = set().union(w for l in all_lists.values() for w in l)
out = {}
for word in all_words:
s = sum(word in l for l in all_lists.values())
if s > 2:
out[word] = s
for k, v in out.items():
print('Word : {}'.format(k))
print('Count: {}'.format(v))
print('Lists: {}'.format(', '.join(kk for kk, vv in all_lists.items() if k in vv )))
print()
Prints:
Word : log
Count: 4
Lists: a, c, d, e
Word : branch
Count: 3
Lists: b, d, e
you can sum the counters - starting with an empty Counter():
from collections import Counter
lists = [a, b, c, d, e]
total = sum((Counter(set(lst)) for lst in lists), Counter())
# Counter({'log': 4, 'branch': 3, 'tree': 2, 'boat': 2, 'water': 2,
# 'wood': 1, 'bark': 1, 'house': 1, 'mill': 1, 'rock': 1})
res = {word: occ for word, occ in total.items() if occ > 2}
# {'log': 4, 'branch': 3}
note that i convert all the lists to a set first in order to avoid double-counts for the words that are more than once in the same list.
if you need to know what list the words were from you could try this:
lists = {"a": a, "b": b, "c": c, "d": d, "e": e}
total = sum((Counter(set(lst)) for lst in lists.values()), Counter())
# Counter({'log': 4, 'branch': 3, 'tree': 2, 'boat': 2, 'water': 2,
# 'wood': 1, 'bark': 1, 'house': 1, 'mill': 1, 'rock': 1})
res = {word: occ for word, occ in total.items() if occ > 2}
# {'log': 4, 'branch': 3}
word_appears_in = {
word: [key for key, value in lists.items() if word in value] for word in res
}
# {'log': ['a', 'c', 'd', 'e'], 'branch': ['b', 'd', 'e']}
x = {"a": 1, "b": 2, "c": 3}
y = {"b": 4, "c": 5, "d": 6}
for key in x:
if key in y:
a = (x[key])
b = (y[key])
This returns a as 2 3 and b as 4 5. What i'm trying to do is multiply the matching key values together and then add those values together. I am not quite sure how to do this. If you guys could help me out that would be great. Thank you in advanced.
A simple way to do this would be to just keep a running total, e.g.:
total = 0
for key in x:
if key in y:
a = x[key]
b = y[key]
total += a*b
print(total) # 23
But python has powerful comprehensions/generators that can simplify this to:
>>> sum(x[key]*y[key] for key in x if key in y)
23
You can use sum with a generator:
x = {"a": 1, "b": 2, "c": 3}
y = {"b": 4, "c": 5, "d": 6}
sum(x[k] * y[k] for k in set(x) & set(y))
# 23