I'd like to use deep_merge option knockout_prefix to remove entries from a hiera array within puppet.
# upper hierarchy
---
foo:
- a
- b
- c
# lower hierarchy
---
foo:
- '--b'
- y
- z
# expected result
foo => [a,c,y,z]
I'm using current puppet 4.x PC1 installation. Hiera hierarchy configuration according to official documentation is
[...] my hierarchy stuff omitted
:merge_behavior: deeper
:deep_merge_options:
:knockout_prefix: '--'
My system should be configured correctly to use this feature:
$ hiera -v
3.0.6
$ /opt/puppetlabs/puppet/bin/gem list --local
*** LOCAL GEMS ***
activemodel (4.2.5)
activesupport (4.2.5)
bigdecimal (1.2.4)
builder (3.2.2)
bundler (1.10.6)
deep_merge (1.0.1)
facter (3.1.4)
[...]
What's my mistake here?
knockout_prefix is used to knockout specific keys and not values in an array, as you have. using your code as an example this would look like
# upper hierarchy
---
lookup_options:
foo:
merge:
strategy: deep
knockout_prefix: '--'
foo:
a: a
b: b
c: c
# lower hierarchy
---
foo:
b: --
y: y
z: z
# expected result
foo = { 'a' => 'a', 'c' => 'c', 'y' => 'y', 'z' => 'z' }
Hiera 3 merge behavior applies to hash merge lookups. The values associated with key 'foo' in your data are arrays, not hashes, so hash-merging does not apply to them. If you attempt a hash merge lookup on them (i.e. $result = hiera_hash('foo')) then Hiera should throw an error.
If you instead perform an array merge lookup (i.e. $result = hiera_array('foo')) then your knockout prefix is irrelevant. In this case, hiera forms an array of the values of the specified key from every hierarchy level (these are expected to be arrays or strings), flattens it, and removes duplicates. On your data, the result of this should be a six-element array, ['a', 'b', 'c', '--b', 'y', 'z'].
And of course there is the ordinary priority lookup, which you get from automated data binding or from explicitly calling the hiera('foo'). Supposing that "upper hierarchy" means the one with higher priority, the result of that lookup would be ['a', 'b', 'c'].
I know I'm coming in a couple of years and puppet versions late, but I recently got it in my head to try exactly this and couldn't find a solution elsewhere. Note that it is not a direct equivalent of a knockout: it's a puppet runtime manipulation of the list, and it doesn't obey the hierarchical order of the entries. That is, if you "knockout" a value at the lowest-weighted hiera entry, it will still be "knocked out" further up the chain.
Here is the code in my puppet class, assuming the foo hiera from the question:
# Make an array of all of the items that start with `--`
# (i.e., those that will be knocked out)
$_foo_knockout = $foo.filter |$item| { $item =~ /^--/ }
# Make a new array of items that:
# * don't directly match an item in the knockout list
# * don't match an item in the knockout list when prefixed with `--`
$_foo_filtered = $foo.filter |$item| {
!($item in $_foo_knockout or "--${ex}" in $_foo_knockout)
}
With the example hiera, this would result in
$foo => ['a','b','c','--b','y','z']
$_foo_filtered => ['a','c','y','z']
Notes:
Tested with hiera = 3.5.0 and puppet = 6.7.2
Array is unique merged
Reiterating that an item can be knocked out from any level of the hierarchy
Related
When using ruamel.yaml version 0.15.92 with Python 3.6.6 on CentOS 7, I cannot seem to update the value of an anchored scalar in a sequence without destroying the anchor itself or creating invalid YAML from the next dump.
I have attempted to recreate the original node type with the new value (old PlainScalarString -> new PlainScalarString, old FoldedScalarString -> new FoldedScalarString, etc), copying the anchor to it. While this restores the anchor to the updated scalar value, it also creates invalid YAML because the first alias later in the YAML file duplicates the same anchor name and assigns to it the old value of the scalar I'm trying to update.
I then attempted to replace all of the affected aliases with actual alias text -- like *anchor_name -- but that causes the value to become quoted like '*anchor_name', rendering the alias useless.
I reverted that and then attempted to suppress the duplicate anchor name (by setting always_dump=False on every affected alias). While that does suppress the duplicate anchor name, it unfortunately just dumps the old value of the anchored scalar.
My entire test data is as follows; assume this is named test.yaml:
# Header comment
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
- &block_password >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAojErrxuNcdX6oR+VA/I3PyuV2CwXx166nIUp
asEHo1/CiCIoE3qCnjK2FJF8vg+l3AqRmdb7vYrqQ+30RFfHSlB9zApSw8NW
tnEpawX4hhKAxnTc/JKStLLu2k7iZkhkor/UA2HeVJcCzEeYAwuOQRPaolmQ
TGHjvm2w6lhFDKFkmETD/tq4gQNcOgLmJ+Pqhogr/5FmGOpJ7VGjpeUwLteM
er3oQozp4l2bUTJ8wk9xY6cN+eeOIcWXCPPdNetoKcVropiwrYH8QV4CZ2Ky
u0vpiybEuBCKhr1EpfqhrtuG5s817eOb7+Wf5ctR0rPuxlTUqdnDY31zZ3Kb
mcjqHDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBATq6BjaxU2bfcLL5S
bxzsgCDsWzggzxsCw4Dp0uYLwvMKjJEpMLeFXGrLHJzTF6U2Nw==]
top_key: unencrypted value
top_alias: *plain_value
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: *string_password
sub:
ignore: value
key: unencrypted subbed-value
# This pulls its block-form value from above
blocked_alias: *block_password
sub_more:
# This is a stringified EYAML value, NOT an alias
inline_string: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqkv6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6HtkolM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoKB4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Also NOT an alias, in block form
block_string: >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12
hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5
TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqk
v6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6Htko
lM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4
osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoK
B4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064
EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Signature line
There are two forms of this issue, so here are two code examples for reproducing the conditions:
First, "How can we most simply update the value of an anchored scalar in a sequence without destroying the anchor or its aliases?" This looks like:
with open('test.yaml', 'r') as f:
yaml_data = yaml.load(f)
yaml_data['aliases'][1] = "New string password"
yaml.dump(yaml_data, sys.stdout)
Note that this destroys the anchor. I would very much prefer the solution look as similar to this first snippet as possible; perhaps something like yaml_data['aliases'][1].set_value("New string password") # Changes only the scalar value while preserving the original anchor, comments, position, et al..
Second, "If we must instead wrap the new value in some object to preserve the anchor (and other attributes of the entry being replaced), what is the simplest approach which also preserves all aliases that refer to it (such that they adopt the updated value) when dumped?" My attempt to solve this requires quite a lot more code including recursive functions. Since SO guidelines advise against dumping large code, I will offer the relevant bits. Please assume the unlisted code is working perfectly well.
### <snip def FindEYAMLPaths(...) returns lists of paths through the YAML to every value starting with 'ENC['>
### <snip def GetYAMLValue(...) returns the node -- as a PlainScalarString, FoldedScalarString, et al. -- identified by a path from FindEYAMLPaths>
### <snip def DisableAnchorDump(...) sets `anchor.always_dump=False` if the node has an anchor attribute>
def ReplaceYAMLValue(value, data, path=None):
if path is None:
return
ref = data
last_ref = path.pop()
for p in path:
ref = ref[p]
# All I'm trying to do here is change the scalar value without disrupting its comments, anchor, positioning, or any of its aliases.
# This succeeds in changing the scalar value and preserving its original anchor, but disrupts its aliases which insist on preserving the old value.
if isinstance(ref[last_ref], PlainScalarString):
ref[last_ref] = PlainScalarString(value, anchor=ref[last_ref].anchor.value)
elif isinstance(ref[last_ref], FoldedScalarString):
ref[last_ref] = FoldedScalarString(value, anchor=ref[last_ref].anchor.value)
else:
ref[last_ref] = value
with open('test.yaml', 'r') as f:
yaml_data = yaml.load(f)
seen_anchors = []
for path in FindEYAMLPaths(yaml_data):
if path is None:
continue
node = GetYAMLValue(yaml_data, deque(path))
if hasattr(node, 'anchor'):
test_anchor = node.anchor.value
if test_anchor is not None:
if test_anchor in seen_anchors:
# This is expected to just be an alias, pointing at the newly updated anchor
DisableAnchorDump(node)
continue
seen_anchors.append(test_anchor)
ReplaceYAMLValue("New string password", yaml_data, path)
yaml.dump(yaml_data, sys.stdout)
Note that this produces valid YAML except that all of the affected aliases are gone, replaced instead by the old value of the anchored scalar.
I expect to be able to change the value of an aliased scalar in a sequence without disrupting any other part of the YAML content. Based on other posts I've seen about ruamel.yaml, I fully accept that I may need to dump the updated YAML to file and reload it for the in-memory aliases to update to the new value. I simply expect to change:
Input File
aliases:
- &some_anchor Old value
usage: *some_anchor
to:
Output File
aliases:
- &some_anchor NEW VALUE
usage: *some_anchor
Instead, here's the output from the above two examples:
First, notice that the original anchor was destroyed and the value for top::hash:stringified_alias: now carries the original anchor and old value instead of the alias to the newly updated scalar value at ['aliases'][1]:
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- New string password
- &block_password >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAojErrxuNcdX6oR+VA/I3PyuV2CwXx166nIUp
asEHo1/CiCIoE3qCnjK2FJF8vg+l3AqRmdb7vYrqQ+30RFfHSlB9zApSw8NW
tnEpawX4hhKAxnTc/JKStLLu2k7iZkhkor/UA2HeVJcCzEeYAwuOQRPaolmQ
TGHjvm2w6lhFDKFkmETD/tq4gQNcOgLmJ+Pqhogr/5FmGOpJ7VGjpeUwLteM
er3oQozp4l2bUTJ8wk9xY6cN+eeOIcWXCPPdNetoKcVropiwrYH8QV4CZ2Ky
u0vpiybEuBCKhr1EpfqhrtuG5s817eOb7+Wf5ctR0rPuxlTUqdnDY31zZ3Kb
mcjqHDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBATq6BjaxU2bfcLL5S
bxzsgCDsWzggzxsCw4Dp0uYLwvMKjJEpMLeFXGrLHJzTF6U2Nw==]
# ... snip ...
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: &string_password ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
# ... snip ...
Second, notice that ['aliases'][1] now looks correct -- it is the new value with the original anchor -- but where I expect to see aliases to it, I instead see the old value. I expect to see *string_password instead of ENC[...].
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password New string password
- &block_password >-
New string password
# ... snip ...
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
# ... snip ...
If you read in an anchored scalar, like your This is unencrypted,
using ruamel.yaml, you get a PlainScalarString object (or one of the other ScalarString
subclasses), which is an extremely thin layer around the basic string
type. That layer has an attribute to store an anchor if applicable (other uses are primarily to
maintain quoting/literal/folding style information). And any aliases using that anchor refer to the same ScalarString instance.
When dumping the anchor attribute is not used to create aliases, that
is is done in the normal way by having multiple references to the same
object. The attribute is only used to write the anchor id and also
does so if there is an attribute but no further references (i.e. an anchor without aliases).
So it is not surprising that if you replace such an object with
multiple references (either at the anchor spot or any of the alias
spots) that the reference disappears. If you then also force the same
anchor name on some other object, you get duplicate anchors, contrary
to the normal anchor/alias generation there is no check done on
"forced" anchors.
Since the ScalarString is such a thin wrapper, they are essentially
immutable objects, just like the string itself. Unlike with aliased
dicts and lists which are collection objects that can be emptied and
then filled (instead of replaced by a new instance), you cannot do
that with string.
The implementation of ScalarString can of course be changed, so you
can have your set_values() method, but involves creating alternative
classes for all the objects (PlainScalarString,
FoldedScalarString). You would have to make sure
these get used for constructing and for representing and then
preferable also behave like normal strings as far as you need it, so
at least you can print.
That is relatively easy to do but requires copying and slightly modifyging several
tens of lines of code
I think it is easier to leave the ScalarStrings in place as is (i.e
being immutable) and do what you need to do if you want to change all
occurences (i.e. references): update all the references to the
original. If your datastructure would contain millions of nodes that
might be prohibitively time consuming, but still would be afraction of what
loading and dumping the YAML itself would take:
import sys
from pathlib import Path
import ruamel.yaml
in_file = Path('test.yaml')
def update_aliased_scalar(data, obj, val):
def recurse(d, ref, nv):
if isinstance(d, dict):
for i, k in [(idx, key) for idx, key in enumerate(d.keys()) if key is ref]:
d.insert(i, nv, d.pop(k))
for k, v in d.non_merged_items():
if v is ref:
d[k] = nv
else:
recurse(v, ref, nv)
elif isinstance(d, list):
for idx, item in enumerate(d):
if item is ref:
d[idx] = nv
else:
recurse(item, ref, nv)
if hasattr(obj, 'anchor'):
recurse(data, obj, type(obj)(val, anchor=obj.anchor.value))
else:
recurse(data, obj, type(obj)(val))
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=4, offset=2)
yaml.preserve_quotes = True
data = yaml.load(in_file)
update_aliased_scalar(data, data['aliases'][1], "New string password")
update_aliased_scalar(data, data['top::hash']['sub']['blocked_alias'], "New block password\n")
yaml.dump(data, sys.stdout)
which gives:
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password New string password
- &block_password >
New block password
top_key: unencrypted value
top_alias: *plain_value
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: *string_password
sub:
ignore: value
key: unencrypted subbed-value
# This pulls its block-form value from above
blocked_alias: *block_password
sub_more:
# This is a stringified EYAML value, NOT an alias
inline_string: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqkv6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6HtkolM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoKB4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Also NOT an alias, in block form
block_string: >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12
hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5
TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqk
v6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6Htko
lM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4
osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoK
B4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064
EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Signature line
As you can see the anchors are preserved and it doesn't matter for update_aliased_scalar if you
provide the anchored "place" or one of the aliased places as a reference.
The above recurse also handles keys that are aliased, as it is perfectly fine for a key in a YAML mapping to have an anchor or to be an alias. You can even have an anchored key with a value that is an alias to the corresponding key.
It would be very nice to have support for in-place modification of existing anchored fields with types ScalarFloat/ScalarInt etc. YAML is often used for config files. One common use case I encountered is to create multiple config files from a very large template config file with only small changes made to the new files. I would load the template file into CommentedMap, modify a small set of keys in place and dump it back into a new yaml config file. This flow works very nicely if the keys to be changed are not anchored. When they are anchored, the anchors are duplicated in the new files as reported by OP and render them invalid. Manually addressing each anchored key in post-processing can be daunting when there are a large number of them.
I have a list containing paths. For example:
links=['main',
'main/path1',
'main/path1/path2',
'main/path1/path2/path3/path4',
'main/path1/path2/path3/path5',
'main/path1/path2/path3/path4/path6']
I want to create a nested dictionary to store these paths in order. Expected output:
Output = {‘main’: {‘path1’: {‘path2’: {‘path3’: {‘path4’: {‘path6’: {} }},‘path5’:{}}}}}
I am new to python coding (v 3.+) and I am unable to solve it. It gets confusing after i reach path 3 as there is path 4 (with path6 nested) and path5 as well. Can someone please help ?
Something like
tree = {}
for path in links: # for each path
node = tree # start from the very top
for level in path.split('/'): # split the path into a list
if level: # if a name is non-empty
node = node.setdefault(level, dict())
# move to the deeper level
# (or create it if unexistent)
With links defined as above, it results in
>>> tree
{'main': {'path1': {'path2': {'path3': {'path4': {'path6': {}}, 'path5': {}}}}}}
I have two lists of nested dictionaries:
lofd1 = [{'A': {'facebook':{'handle':'https://www.facebook.com/pages/New-Jersey/108325505857259','logo_id': None}, 'contact':{'emails':['nj#nj.gov','state#nj.gov']},'state': 'nj', 'population':'12345', 'capital':'Jersey','description':'garden state'}}]
lofd2 = [{'B':{'building_type':'ranch', 'city':'elizabeth', 'state':'nj', 'description':'the state close to NY'}}]
I need to:
Merge similar dictionaries in the lists, using the value of the 'state' key (for example, merge all dictionaries where "state" = "nj" into a single dictionary
It should include key/value combinations that are present in both dictionaries once (for example, "state" for both should be "nj")
It should include key/value combinations, that are not present in one of the dictionaries (for exmaple, "population", "capital" from the lofd1 and "building_type", "city" from lofd2).
Some of the values in dictionaries should be excluded, for example, 'logo_id':None
Put values in "description" from both dictionaries into a list of strings, for example '"description" : ['garden state', 'the state close to NY']'
The final dataset should look like this:
lofd_final = [{'state': 'nj', 'facebook':{'handle':'https://www.facebook.com/pages/New-Jersey/108325505857259'},'population':'12345', 'capital':'Jersey', 'contact':{'emails':['nj#nj.gov','state#nj.gov']}, 'description': ['garden state','the state close to NY'],'building_type':'ranch', 'city':'elizabeth'}]
What would be an efficient solution?
This is a solution very specific to your case. In terms of time complexity it is; O(n*m), n being the number of dicionaries in a list and m being the number of keys in a dictionary. You only ever look at each key in each dictionary once.
def extract_data(lofd, output):
for d in lofd:
for top_level_key in d: # This will be the A or B key from your example
data = d[top_level_key]
state = data['state']
if state not in output: # Create the state entry for the first time
output[state] = {}
# Now update the state entry with the data you care about
for key in data:
# Handle descriptions
if key == 'description':
if 'description' not in output[state]:
output[state]['description'] = [data['description']]
else:
output[state]['description'].append(data['description'])
# Handle all other keys
else:
# Handle facebook key (exclude logo_id)
if key == 'facebook':
del data['facebook']['logo_id']
output[state][key] = data[key]
output = {}
extract_data(lofd1, output)
extract_data(lofd2, output)
print(list(output.values()))
The output will be a dict of dicts, with the top level keys as the states. To convert it to how you specified just extract the values into a flat list: list(output.values()) (see above example).
Note: I am assuming a deep copy is not needed. So after you extract the data, I'm assuming you don't go and manipulate the values in lofd1 and lofd2. Also this is purely based on the specs that were given, e.g. if there are more nested keys that need to be excluded, you will need to add extra filters yourself.
I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.
I'm taking a programming class and have our first assignment. I understand how it's supposed to work, but apparently I haven't hit upon the correct terms to search to get help (and the book is less than useless).
The assignment is to take a provided data set (names and numbers) and perform some manipulation and computation with it.
I'm able to get the names into a list, and know the general format of what commands I'm giving, but the specifics are evading me. I know that you refer to the numbers as names[0][1], names[1][1], etc, but not how to refer to just that record that is being changed. For example, we have to have the program check if a name begins with a letter that is Q or later; if it does, we double the number associated with that name.
This is what I have so far, with ??? indicating where I know something goes, but not sure what it's called to search for it.
It's homework, so I'm not really looking for answers, but guidance to figure out the right terms to search for my answers. I already found some stuff on the site (like the statistics functions), but just can't find everything the book doesn't even mention.
names = [("Jack",456),("Kayden",355),("Randy",765),("Lisa",635),("Devin",358),("LaWanda",452),("William",308),("Patrcia",256)]
length = len(names)
count = 0
while True
count < length:
if ??? > "Q" # checks if first letter of name is greater than Q
??? # doubles number associated with name
count += 1
print(names) # self-check
numberNames = names # creates new list
import statistics
mean = statistics.mean(???)
median = statistics.median(???)
print("Mean value: {0:.2f}".format(mean))
alphaNames = sorted(numberNames) # sorts names list by name and creates new list
print(alphaNames)
first of all you need to iter over your names list. To do so use for loop:
for person in names:
print(person)
But names are a list of tuples so you will need to get the person name by accessing the first item of the tuple. You do this just like you do with lists
name = person[0]
score = person[1]
Finally to get the ASCII code of a character, you use ord() function. That is going to be helpful to know if name starts with a Q or above.
print(ord('A'))
print(ord('Q'))
print(ord('R'))
This should be enough informations to get you started with.
I see a few parts to your question, so I'll try to separate them out in my response.
check if first letter of name is greater than Q
Hopefully this will help you with the syntax here. Like list, str also supports element access by index with the [] syntax.
$ names = [("Jack",456),("Kayden",355)]
$ names[0]
('Jack', 456)
$ names[0][0]
'Jack'
$ names[0][0][0]
'J'
$ names[0][0][0] < 'Q'
True
$ names[0][0][0] > 'Q'
False
double number associated with name
$ names[0][1]
456
$ names[0][1] * 2
912
"how to refer to just that record that is being changed"
We are trying to update the value associated with the name.
In theme with my previous code examples - that is, we want to update the value at index 1 of the tuple stored at index 0 in the list called names
However, tuples are immutable so we have to be a little tricky if we want to use the data structure you're using.
$ names = [("Jack",456), ("Kayden", 355)]
$ names[0]
('Jack', 456)
$ tpl = names[0]
$ tpl = (tpl[0], tpl[1] * 2)
$ tpl
('Jack', 912)
$ names[0] = tpl
$ names
[('Jack', 912), ('Kayden', 355)]
Do this for all tuples in the list
We need to do this for the whole list, it looks like you were onto that with your while loop. Your counter variable for indexing the list is named count so just use that to index a specific tuple, like: names[count][0] for the countth name or names[count][1] for the countth number.
using statistics for calculating mean and median
I recommend looking at the documentation for a module when you want to know how to use it. Here is an example for mean:
mean(data)
Return the sample arithmetic mean of data.
$ mean([1, 2, 3, 4, 4])
2.8
Hopefully these examples help you with the syntax for continuing your assignment, although this could turn into a long discussion.
The title of your post is "Need help working with lists within lists" ... well, your code example uses a list of tuples
$ names = [("Jack",456),("Kayden",355)]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'tuple'>
$ names = [["Jack",456], ["Kayden", 355]]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'list'>
notice the difference in the [] and ()
If you are free to structure the data however you like, then I would recommend using a dict (read: dictionary).
I know that you refer to the numbers as names[0][1], names[1][1], etc, but
not how to refer to just that record that is being changed. For
example, we have to have the program check if a name begins with a
letter that is Q or later; if it does, we double the number associated
with that name.
It's not entirely clear what else you have to do in this assignment, but regarding your concerns above, to reference the ith"record that is being changed" in your names list, simply use names[i]. So, if you want to access the first record in names, simply use names[0], since indexing in Python begins at zero.
Since each element in your list is a tuple (which can also be indexed), using constructs like names[0][0] and names[0][1] are ways to index the values within the tuple, as you pointed out.
I'm unsure why you're using while True if you're trying to iterate through each name and check whether it begins with "Q". It seems like a for loop would be better, unless your class hasn't gotten there yet.
As for checking whether the first letter is 'Q', str (string) objects are indexed similarly to lists and tuples. To access the first letter in a string, for example, see the following:
>>> my_string = 'Hello'
>>> my_string[0]
'H'
If you give more information, we can help guide you with the statistics piece, as well. But I would first suggest you get some background around mean and median (if you're unfamiliar).