Comparing two jumbled / unaligned text files - text

I have two text files containing IL code for example, both text files include 100% the same code:
http://pastebin.com/iDvbu1tD
http://pastebin.com/u5fi9NMh
They are however unaligned / Jumbled so its not possible for me to find any differences, i am looking for a way to be able to take such files and filter out the code that is the same and only show blocks that are different.
Is there a way to do such a task ?

First the "TL;DR": the two pastes are in fact identical, as you say, except for a simple difference in the order of the methods, and a very minor formatting problem. Namely, the .method line of the middle method in each disassembly listing isn't indented by two spaces like the other two .method lines in the same file.
I copied the files to a Ubuntu Linux system, and took the liberty of fixing the minor .method formatting problem by hand in both of them.
Then I wrote some extraction code using the TXR language whose purpose it is to normalize the file: it extract the methods, identifying the name of each one and then prints a reformatted version of the disassembly. These reformatted versions can then be compared.
In the reformatting, the methods are sorted by name. Also, the IL_XXXX offsets on the instructions are deleted, so that minor differences, like the insertion of an instruction sequence, will not result in huge differences due to the subsequent offsets changing. However, this is not demonstrated with the given data, since the code is in fact identical.
Here is what the reformatting of one of the inputs looks like, abbreviated manually:
$ txr norm.txr disasm-a
.method public hidebysig newslot virtual final instance bool IsRoomConnected() cil managed
{
// Code size 21 (0x15)
.maxstack 8
ldarg.0
ldfld class [UnityEngine]UnityEngine.AndroidJavaObject GooglePlayGames.Android.AndroidRtmpClient::mRoom
brfalse IL_0013
ldarg.0
[ ... snip ... ]
ldc.i4.0
ret
} // end of method AndroidRtmpClient::IsRoomConnected
[ ... other methods ... ]
The .method material is folded into one line, IsRoomConnected is first because the other two functions start with P and O, respectively, and the offsets are gone from the instructions.
Comparing the reformatted versions can be done in the Linux environment in one step without temporary files thanks to Bash "process substitution" <(command ...) syntax. This syntax lets us use, in the place of a file name argument, a program which produces output. The program which is invoked thinks it is a file:
$ diff -u <(txr norm.txr disasm-a) <(txr norm.txr disasm-b)
# no output!
That is to say, when we diff the normalized versions of the two disassembly listings, there is no output: they are character-for-character identical!
A code listing of norm.txr follows:
(collect)
# (freeform)
.method #decl {
# (bind pieces #(tok-str decl #/\S+/))
# (bind name #(find-if (op search-str #1 "(") pieces))
# (collect)
# (cases)
IL_#offset: #line
# (or)
#line
# (end)
# (last)
} // #eom
# (end)
#(end)
#(set (name decl line eom) #(multi-sort (list name decl line eom)
[list less]))
#(output)
# (repeat)
.method #decl
{
# (repeat)
#line
# (end)
} // #eom
# (end)
#(end)

Related

How to get pandoc lua filter avoid counting words using this pattern in code blocks inside Rmarkdown file?

This is a follow-up question to this post. What I want to achieve is to avoid counting words in headers and inside code blocks having this pattern:
```{r label-name}
all code words not to be counted.
```
Rather than this pattern:
```
{r label-name}
all code words not to be counted.
```
Because when I use the latter pattern I lose my fontification lock in the Rmarkdown buffer in emacs, so I always use the first one.
Consider this MWE:
MWE (MWE-wordcount.Rmd)
# Results {-}
## Topic 1 {-}
This is just a random text with a citation in markdown \#ref(fig:pca-scree)).
Below is a code block.
```{r pca-scree, echo = FALSE, fig.align = "left", out.width = "80%", fig.cap = "Scree plot with parallel analysis using simulated data of 100 iterations (red line) suggests retaining only the first 2 components. Observed dimensions with their eigenvalues are shown in green."}
knitr::include_graphics("./plots/PCA_scree_parallel_analysis.png")
```
## Topic 2 {-}
<!-- todo: a comment that needs to be avoided by word count hopefully-->
The result should be 17 words only. Not counting words in code blocks, comments, or Markdown markups (like the headers).
I followed the method explained here to get pandoc count the words using a lua filter. In short I did these steps:
from command line:
mkdir -p ~/.local/share/pandoc/filters
Then created a file there named wordcount.lua with this content:
-- counts words in a document
words = 0
wordcount = {
Str = function(el)
-- we don't count a word if it's entirely punctuation:
if el.text:match("%P") then
words = words + 1
end
end,
Code = function(el)
_,n = el.text:gsub("%S+","")
words = words + n
end,
}
function Pandoc(el)
-- skip metadata, just count body:
pandoc.walk_block(pandoc.Div(el.blocks), wordcount)
print(words .. " words in body")
os.exit(0)
end
I put the following elisp code in scratch buffer and evaluated it:
(defun pandoc-count-words ()
(interactive)
(shell-command-on-region (point-min) (point-max)
"pandoc --lua-filter wordcount.lua"))
From inside the MWE Markdown file (MWE-wordcount.Rmd) I issued M-x pandoc-count-wordsand I get the count in the minibuffer.
Using the first pattern I get 62 words.
Using the second pattern I get 22 words, more reasonable.
This method successfully avoids counting words inside a comment.
Questions
How to get the lua filter code avoid counting words using the first pattern rather than the second?
How to get the lua filter avoid counting words in the headers ##?
I would also appreciate if the answer explains how lua code works.
This is a fun question, it combines quite a few technologies. The most important here is R Markdown, and we need to look under the hood to understand what's going on.
One of the first step in R Markdown processing is to parse the document, find all R code blocks (marked by the {r ...} pattern, execute those blocks, and replaces the blocks with the evaluation results. The modified input text is then passed to pandoc, which parses it into an abstract document tree (AST). That AST can be examined or modified with a filter before pandoc writes the document in the target format.
This is relevant because it is R Markdown, not pandoc, that recognizes input of the form
``` {r ...}
# code
```
as code blocks, while pandoc parses them as inline code that is identical to ` {r ...} # code `, i.e., all newlines in the code are ignored. The reason for this lies in pandoc's attribute parsing and the overloading of ` chars in Markdown syntax.¹
This gives us the answer to your first question: we can't! The two code snippets look exactly the same by the time they reach the filter in pandoc's AST; they cannot be distinguished. However, we get proper code blocks with newlines if we run R Markdown's knitr step to execute the code.
So one solution could be to make the wordcount.lua filter part of the R Markdown processing step, but to run the filter only when the COUNT_WORDS environment variable is set. We can do that by adding this snippet to the top of the filter file:
if not os.getenv 'COUNT_WORDS` then
return {}
end
See the R Markdown cookbook on how to integrate the filter.
I'm leaving out the second question, because this answer is already quite long and that subquestion is worth a separate post.
¹: pandoc would recognize this as a code block if the r was preceded by a dot, as in
``` {.r}
# code
```

loop to read multiple files

I am using Obspy _read_segy function to read a segy file using following line of code:
line_1=_read_segy('st1.segy')
However I have a large number of files in a folder as follow:
st1.segy
st2.segy
st3.segy
.
.
st700.segy
I want to use a for loop to read the data but I am new so can any one help me in this regard.
Currently i am using repeated lines to read data as follow:
line_2=_read_segy('st1.segy')
line_2=_read_segy('st2.segy')
The next step is to display the segy data using matplotlib and again i am using following line of code on individual lines which makes it way to much repeated work. Can someone help me with creating a loop to display the data and save the figures .
data=np.stack(t.data for t in line_1.traces)
vm=np.percentile(data,99)
plt.figure(figsize=(60,30))
plt.imshow(data.T, cmap='seismic',vmin=-vm, vmax=vm, aspect='auto')
plt.title('Line_1')
plt.savefig('Line_1.png')
plt.show()
Your kind suggestions will help me a lot as I am a beginner in python programming.
Thank you
If you want to reduce code duplication, you use something called functions. And If you want to repeatedly do something, you can use loops. So you can call a function in a loop, if you want to do this for all files.
Now, for reading the files in folder, you can use glob package of python. Something like below:
import glob, os
def save_fig(in_file_name, out_file_name):
line_1 = _read_segy(in_file_name)
data = np.stack(t.data for t in line_1.traces)
vm = np.percentile(data, 99)
plt.figure(figsize=(60, 30))
plt.imshow(data.T, cmap='seismic', vmin=-vm, vmax=vm, aspect='auto')
plt.title(out_file_name)
plt.savefig(out_file_name)
segy_files = list(glob.glob(segy_files_path+"/*.segy"))
for index, file in enumerate(segy_files):
save_fig(file, "Line_{}.png".format(index + 1))
I have not added other imports here, which you know to add!. segy_files_path is the folder where your files reside.
You just need to dynamically open the files in a loop. Fortunately they all follow the same naming pattern.
N = 700
for n in range(N):
line_n =_read_segy(f"st{n}.segy") # Dynamic name.
data = np.stack(t.data for t in line_n.traces)
vm = np.percentile(data, 99)
plt.figure(figsize=(60, 30))
plt.imshow(data.T, cmap="seismic", vmin=-vm, vmax=vm, aspect="auto")
plt.title(f"Line_{n}")
plt.show()
plt.savefig(f"Line_{n}.png")
plt.close() # Needed if you don't want to keep 700 figures open.
I'll focus on addressing the file looping, as you said you're new and I'm assuming simple loops are something you'd like to learn about (the first example is sufficient for this).
If you'd like an answer to your second question, it might be worth providing some example data, the output result (graph) of your current attempt, and a description of your desired output. If you provide that reproducible example and clear description of the problem you're having it'd be easier to answer.
Create a list (or other iterable) to hold the file names to read, and another container (maybe a dict) to hold the result of your read_segy.
files = ['st1.segy', 'st2.segy']
lines = {} # creates an empty dictionary; dictionaries consist of key: value pairs
for f in files: # f will first be 'st1.segy', then 'st2.segy'
lines[f] = read_segy(f)
As stated in the comment by #Guimoute, if you want to dynamically generate the file names, you can create the files list by pasting integers to the base file name.
lines = {} # creates an empty dictionary; dictionaries have key: value pairs
missing_files = []
for i in range(1, 701):
f = f"st{str(i)}.segy" # would give "st1.segy" for i = 1
try: # in case one of the files is missing or can’t be read
lines[f] = read_segy(f)
except:
missing_files.append(f) # store names of missing or unreadable files

How to change an anchored scalar in a sequence without destroying the anchor in ruamel.yaml?

When using ruamel.yaml version 0.15.92 with Python 3.6.6 on CentOS 7, I cannot seem to update the value of an anchored scalar in a sequence without destroying the anchor itself or creating invalid YAML from the next dump.
I have attempted to recreate the original node type with the new value (old PlainScalarString -> new PlainScalarString, old FoldedScalarString -> new FoldedScalarString, etc), copying the anchor to it. While this restores the anchor to the updated scalar value, it also creates invalid YAML because the first alias later in the YAML file duplicates the same anchor name and assigns to it the old value of the scalar I'm trying to update.
I then attempted to replace all of the affected aliases with actual alias text -- like *anchor_name -- but that causes the value to become quoted like '*anchor_name', rendering the alias useless.
I reverted that and then attempted to suppress the duplicate anchor name (by setting always_dump=False on every affected alias). While that does suppress the duplicate anchor name, it unfortunately just dumps the old value of the anchored scalar.
My entire test data is as follows; assume this is named test.yaml:
# Header comment
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
- &block_password >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAojErrxuNcdX6oR+VA/I3PyuV2CwXx166nIUp
asEHo1/CiCIoE3qCnjK2FJF8vg+l3AqRmdb7vYrqQ+30RFfHSlB9zApSw8NW
tnEpawX4hhKAxnTc/JKStLLu2k7iZkhkor/UA2HeVJcCzEeYAwuOQRPaolmQ
TGHjvm2w6lhFDKFkmETD/tq4gQNcOgLmJ+Pqhogr/5FmGOpJ7VGjpeUwLteM
er3oQozp4l2bUTJ8wk9xY6cN+eeOIcWXCPPdNetoKcVropiwrYH8QV4CZ2Ky
u0vpiybEuBCKhr1EpfqhrtuG5s817eOb7+Wf5ctR0rPuxlTUqdnDY31zZ3Kb
mcjqHDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBATq6BjaxU2bfcLL5S
bxzsgCDsWzggzxsCw4Dp0uYLwvMKjJEpMLeFXGrLHJzTF6U2Nw==]
top_key: unencrypted value
top_alias: *plain_value
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: *string_password
sub:
ignore: value
key: unencrypted subbed-value
# This pulls its block-form value from above
blocked_alias: *block_password
sub_more:
# This is a stringified EYAML value, NOT an alias
inline_string: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqkv6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6HtkolM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoKB4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Also NOT an alias, in block form
block_string: >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12
hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5
TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqk
v6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6Htko
lM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4
osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoK
B4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064
EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Signature line
There are two forms of this issue, so here are two code examples for reproducing the conditions:
First, "How can we most simply update the value of an anchored scalar in a sequence without destroying the anchor or its aliases?" This looks like:
with open('test.yaml', 'r') as f:
yaml_data = yaml.load(f)
yaml_data['aliases'][1] = "New string password"
yaml.dump(yaml_data, sys.stdout)
Note that this destroys the anchor. I would very much prefer the solution look as similar to this first snippet as possible; perhaps something like yaml_data['aliases'][1].set_value("New string password") # Changes only the scalar value while preserving the original anchor, comments, position, et al..
Second, "If we must instead wrap the new value in some object to preserve the anchor (and other attributes of the entry being replaced), what is the simplest approach which also preserves all aliases that refer to it (such that they adopt the updated value) when dumped?" My attempt to solve this requires quite a lot more code including recursive functions. Since SO guidelines advise against dumping large code, I will offer the relevant bits. Please assume the unlisted code is working perfectly well.
### <snip def FindEYAMLPaths(...) returns lists of paths through the YAML to every value starting with 'ENC['>
### <snip def GetYAMLValue(...) returns the node -- as a PlainScalarString, FoldedScalarString, et al. -- identified by a path from FindEYAMLPaths>
### <snip def DisableAnchorDump(...) sets `anchor.always_dump=False` if the node has an anchor attribute>
def ReplaceYAMLValue(value, data, path=None):
if path is None:
return
ref = data
last_ref = path.pop()
for p in path:
ref = ref[p]
# All I'm trying to do here is change the scalar value without disrupting its comments, anchor, positioning, or any of its aliases.
# This succeeds in changing the scalar value and preserving its original anchor, but disrupts its aliases which insist on preserving the old value.
if isinstance(ref[last_ref], PlainScalarString):
ref[last_ref] = PlainScalarString(value, anchor=ref[last_ref].anchor.value)
elif isinstance(ref[last_ref], FoldedScalarString):
ref[last_ref] = FoldedScalarString(value, anchor=ref[last_ref].anchor.value)
else:
ref[last_ref] = value
with open('test.yaml', 'r') as f:
yaml_data = yaml.load(f)
seen_anchors = []
for path in FindEYAMLPaths(yaml_data):
if path is None:
continue
node = GetYAMLValue(yaml_data, deque(path))
if hasattr(node, 'anchor'):
test_anchor = node.anchor.value
if test_anchor is not None:
if test_anchor in seen_anchors:
# This is expected to just be an alias, pointing at the newly updated anchor
DisableAnchorDump(node)
continue
seen_anchors.append(test_anchor)
ReplaceYAMLValue("New string password", yaml_data, path)
yaml.dump(yaml_data, sys.stdout)
Note that this produces valid YAML except that all of the affected aliases are gone, replaced instead by the old value of the anchored scalar.
I expect to be able to change the value of an aliased scalar in a sequence without disrupting any other part of the YAML content. Based on other posts I've seen about ruamel.yaml, I fully accept that I may need to dump the updated YAML to file and reload it for the in-memory aliases to update to the new value. I simply expect to change:
Input File
aliases:
- &some_anchor Old value
usage: *some_anchor
to:
Output File
aliases:
- &some_anchor NEW VALUE
usage: *some_anchor
Instead, here's the output from the above two examples:
First, notice that the original anchor was destroyed and the value for top::hash:stringified_alias: now carries the original anchor and old value instead of the alias to the newly updated scalar value at ['aliases'][1]:
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- New string password
- &block_password >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAojErrxuNcdX6oR+VA/I3PyuV2CwXx166nIUp
asEHo1/CiCIoE3qCnjK2FJF8vg+l3AqRmdb7vYrqQ+30RFfHSlB9zApSw8NW
tnEpawX4hhKAxnTc/JKStLLu2k7iZkhkor/UA2HeVJcCzEeYAwuOQRPaolmQ
TGHjvm2w6lhFDKFkmETD/tq4gQNcOgLmJ+Pqhogr/5FmGOpJ7VGjpeUwLteM
er3oQozp4l2bUTJ8wk9xY6cN+eeOIcWXCPPdNetoKcVropiwrYH8QV4CZ2Ky
u0vpiybEuBCKhr1EpfqhrtuG5s817eOb7+Wf5ctR0rPuxlTUqdnDY31zZ3Kb
mcjqHDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBATq6BjaxU2bfcLL5S
bxzsgCDsWzggzxsCw4Dp0uYLwvMKjJEpMLeFXGrLHJzTF6U2Nw==]
# ... snip ...
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: &string_password ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
# ... snip ...
Second, notice that ['aliases'][1] now looks correct -- it is the new value with the original anchor -- but where I expect to see aliases to it, I instead see the old value. I expect to see *string_password instead of ENC[...].
---
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password New string password
- &block_password >-
New string password
# ... snip ...
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAYnFbMveZGBgd9aw7h4VV+M202zRdcP96UQs1q+ViznJK2Ee08hoW9jdIqVhNaecYALUihKjVYijJa649VF7BLZXV0svLEHD8LZeduoLS3iC9uszdhDFB2Q6R/Vv/ARjHNoWc6/D0nFN9vwcrQNITnvREl0WXYpR9SmW0krUpyr90gSAxTxPNJVlEOtA0afeJiXOtQEu/b8n+UDM3eXXRO+2SEXM4ub7fNcj6V9DgT3WwKBUjqzQ5DicnB19FNQ1cBGcmCo8qRv0JtbVqZ4+WJFGc06hOTcAJPsAaWWUn80ChcTnl4ELNzpJFoxAxHgepirskuIvuWZv3h/PL8Ez3NDBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBBSuVIsvWXMmdFJtJmtJxXxgCAGFCioe/zdphGqynmj6vVDnCjA3Xc0VPOCmmCl/cTKdg==]
# ... snip ...
If you read in an anchored scalar, like your This is unencrypted,
using ruamel.yaml, you get a PlainScalarString object (or one of the other ScalarString
subclasses), which is an extremely thin layer around the basic string
type. That layer has an attribute to store an anchor if applicable (other uses are primarily to
maintain quoting/literal/folding style information). And any aliases using that anchor refer to the same ScalarString instance.
When dumping the anchor attribute is not used to create aliases, that
is is done in the normal way by having multiple references to the same
object. The attribute is only used to write the anchor id and also
does so if there is an attribute but no further references (i.e. an anchor without aliases).
So it is not surprising that if you replace such an object with
multiple references (either at the anchor spot or any of the alias
spots) that the reference disappears. If you then also force the same
anchor name on some other object, you get duplicate anchors, contrary
to the normal anchor/alias generation there is no check done on
"forced" anchors.
Since the ScalarString is such a thin wrapper, they are essentially
immutable objects, just like the string itself. Unlike with aliased
dicts and lists which are collection objects that can be emptied and
then filled (instead of replaced by a new instance), you cannot do
that with string.
The implementation of ScalarString can of course be changed, so you
can have your set_values() method, but involves creating alternative
classes for all the objects (PlainScalarString,
FoldedScalarString). You would have to make sure
these get used for constructing and for representing and then
preferable also behave like normal strings as far as you need it, so
at least you can print.
That is relatively easy to do but requires copying and slightly modifyging several
tens of lines of code
I think it is easier to leave the ScalarStrings in place as is (i.e
being immutable) and do what you need to do if you want to change all
occurences (i.e. references): update all the references to the
original. If your datastructure would contain millions of nodes that
might be prohibitively time consuming, but still would be afraction of what
loading and dumping the YAML itself would take:
import sys
from pathlib import Path
import ruamel.yaml
in_file = Path('test.yaml')
def update_aliased_scalar(data, obj, val):
def recurse(d, ref, nv):
if isinstance(d, dict):
for i, k in [(idx, key) for idx, key in enumerate(d.keys()) if key is ref]:
d.insert(i, nv, d.pop(k))
for k, v in d.non_merged_items():
if v is ref:
d[k] = nv
else:
recurse(v, ref, nv)
elif isinstance(d, list):
for idx, item in enumerate(d):
if item is ref:
d[idx] = nv
else:
recurse(item, ref, nv)
if hasattr(obj, 'anchor'):
recurse(data, obj, type(obj)(val, anchor=obj.anchor.value))
else:
recurse(data, obj, type(obj)(val))
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=2, sequence=4, offset=2)
yaml.preserve_quotes = True
data = yaml.load(in_file)
update_aliased_scalar(data, data['aliases'][1], "New string password")
update_aliased_scalar(data, data['top::hash']['sub']['blocked_alias'], "New block password\n")
yaml.dump(data, sys.stdout)
which gives:
# Post-header comment
# Reusable aliases
aliases:
- &plain_value This is unencrypted
- &string_password New string password
- &block_password >
New block password
top_key: unencrypted value
top_alias: *plain_value
top::hash:
ignore: more
# This pulls its string-form value from above
stringified_alias: *string_password
sub:
ignore: value
key: unencrypted subbed-value
# This pulls its block-form value from above
blocked_alias: *block_password
sub_more:
# This is a stringified EYAML value, NOT an alias
inline_string: ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEwDQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqkv6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6HtkolM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoKB4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Also NOT an alias, in block form
block_string: >
ENC[PKCS7,MIIBiQYJKoZIhvcNAQcDoIIBejCCAXYCAQAxggEhMIIBHQIBADAFMAACAQEw
DQYJKoZIhvcNAQEBBQAEggEAafmyrrae2kx8HdyPmn/RHQRcTPhqpx5Idm12
hCDCIbwVM++H+c620z4EN2wlugz/GcLaiGsybaVWzAZ+3r+1+EwXn5ec4dJ5
TTqo7oxThwUMa+SHliipDJwGoGii/H+y2I+3+irhDYmACL2nyJ4dv4IUXwqk
v6nh1J9MwcOkGES2SKiDm/WwfkbPIZc3ccp1FI9AX/m3SVqEcvsrAfw6Htko
lM22csfuJREHkTp7nBapDvOkWn4plzfOw9VhPKhq1x9DUCVFqqG/HAKv++v4
osClK6k1MmSJWaMHrW1z3n7LftV9ZZ60E0Cgro2xSaD+itRwBp07H0GeWuoK
B4+44TBMBgkqhkiG9w0BBwEwHQYJYIZIAWUDBAEqBBCRv9r2lvQ1GJMoD064
EtdigCCw43EAKZWOc41yEjknjRaWDm1VUug6I90lxCsUrxoaMA==]
# Signature line
As you can see the anchors are preserved and it doesn't matter for update_aliased_scalar if you
provide the anchored "place" or one of the aliased places as a reference.
The above recurse also handles keys that are aliased, as it is perfectly fine for a key in a YAML mapping to have an anchor or to be an alias. You can even have an anchored key with a value that is an alias to the corresponding key.
It would be very nice to have support for in-place modification of existing anchored fields with types ScalarFloat/ScalarInt etc. YAML is often used for config files. One common use case I encountered is to create multiple config files from a very large template config file with only small changes made to the new files. I would load the template file into CommentedMap, modify a small set of keys in place and dump it back into a new yaml config file. This flow works very nicely if the keys to be changed are not anchored. When they are anchored, the anchors are duplicated in the new files as reported by OP and render them invalid. Manually addressing each anchored key in post-processing can be daunting when there are a large number of them.

Reading a particular column value from a file using Perl

I am very new to Perl. I have a file called test.txt, the following code shows the data in the file
test count seed
in 5 100
checks 5 100
comb 5 100
reload 5 100
reset 5 100
There are 3 columns in the file those are test, count and seed. First, I want to read test name "in" and then I want to store into one variable after that 5 and 10 into the different variables like $test=in, $count=5 and $seed= 100, because I will call these variables into one more file, that's why I'm storing the variables and after that checks, comb, reload and reset likewise. Each and every time it should call test and it has to store into the the declared variables.
I'm able to read the file, but I'm not able to do this. Finally, I'll print those values to see whether the values a storing are not.
finally i got the answer with my brain, here is the code
while($line=<RD>)
{
if($line=~m/^\#\w+/)
{
}
elsif($line=~m/^\s*\w+/)
{
$line1=$&;
$line1=~s/\s+//g;
#print "$line1\n";
push(#arr,split(" ",$line));
$test_count=$arr[1];
for($i=1;$i<=$test_count;$i=$i+1)
{
print "make $arr[0] $arr[2] $arr[3] $arr[4]\n";
}
$#arr=-1;
}
}
You say:
I want to store into one variable after that 5 and 10 into the
different variables like $test=in, $count=5 and $seed= 100
I bet that's not what you want. I bet it will be far more useful for you to store the data in a complex data structure. Something like this:
my #data;
my #cols;
# Assuming you've opened the file and stored
# the filehandle in $fh
while (<$fh>) {
# Skip lines without any data
next unless /\S/;
if ($. == 1) {
#cols = split;
next;
}
my %row;
#row{#cols} = split;
push #data, \%row;
}
As you've made no effort to attempt the problem (or, at least, you've not shown any evidence of any effort) I'm not going to spend time explaining my solution to you. You'll get some useful information from reading perldoc perldsc.

How to write Fortran Output as CSV file?

Can any one tell me, how can I write my output of Fortran program in CSV format? So I can open the CSV file in Excel for plotting data.
A slightly simpler version of the write statement could be:
write (1, '(1x, F, 3(",", F))') a(1), a(2), a(3), a(4)
Of course, this only works if your data is numeric or easily repeatable. You can leave the formatting to your spreadsheet program or be more explicit here.
I'd also recommend the csv_file module from FLIBS. Fortran is well equipped to read csv files, but not so much to write them. With the csv_file module, you put
use csv_file
at the beginning of your function/subroutine and then call it with:
call csv_write(unit, value, advance)
where unit = the file unit number, value = the array or scalar value you want to write, and advance = .true. or .false. depending on whether you want to advance to the next line or not.
Sample program:
program write_csv
use csv_file
implicit none
integer :: a(3), b(2)
open(unit=1,file='test.txt',status='unknown')
a = (/1,2,3/)
b = (/4,5/)
call csv_write(1,a,.true.)
call csv_write(1,b,.true.)
end program
output:
1,2,3
4,5
if you instead just want to use the write command, I think you have to do it like this:
write(1,'(I1,A,I1,A,I1)') a(1),',',a(2),',',a(3)
write(1,'(I1,A,I1)') b(1),',',b(2)
which is very convoluted and requires you to know the maximum number of digits your values will have.
I'd strongly suggest using the csv_file module. It's certainly saved me many hours of frustration.
The Intel and gfortran (5.5) compilers recognize:
write(unit,'(*(G0.6,:,","))')array or data structure
which doesn't have excess blanks, and the line can have more than 999 columns.
To remove excess blanks with F95, first write into a character buffer and then use your own CSV_write program to take out the excess blanks, like this:
write(Buf,'(999(G21.6,:,","))')array or data structure
call CSV_write(unit,Buf)
You can also use
write(Buf,*)array or data structure
call CSV_write(unit,Buf)
where your CSV_write program replaces whitespace with "," in Buf. This is problematic in that it doesn't separate character variables unless there are extra blanks (i.e. 'a ','abc ' is OK).
I thought a full simple example without any other library might help. I assume you are working with matrices, since you want to plot from Excel (in any case it should be easy to extend the example).
tl;dr
Print one row at a time in a loop using the format format(1x, *(g0, ", "))
Full story
The purpose of the code below is to write in CSV format (that you can easily import in Excel) a (3x4) matrix.
The important line is the one labeled 101. It sets the format.
program testcsv
IMPLICIT NONE
INTEGER :: i, nrow
REAL, DIMENSION(3,4) :: matrix
! Create a sample matrix
matrix = RESHAPE(source = (/1,2,3,4,5,6,7,8,9,10,11,12/), &
shape = (/ 3, 4 /))
! Store the number of rows
nrow = SIZE(matrix, 1)
! Formatting for CSV
101 format(1x, *(g0, ", "))
! Open connection (i.e. create file where to write)
OPEN(unit = 10, access = "sequential", action = "write", &
status = "replace", file = "data.csv", form = "formatted")
! Loop across rows
do i=1,3
WRITE(10, 101) matrix(i,:)
end do
! Close connection
CLOSE(10)
end program testcsv
We first create the sample matrix. Then store the number of rows in the variable nrow (this is useful when you are not sure of the matrix's dimension beforehand). Skip a second the format statement. What we do next is to open (create or replace) the CSV file, names data.csv. Then we loop over the rows (do statement) of the matrix to write a row at a time (write statement) in the CSV file; rows will be appended one after another.
In more details how the write statement works is: WRITE(U,FMT) WHAT. We write "what" (the i-th row of the matrix: matrix(i,:)), to connection U (the one we created with the open statement), formatting the WHAT according to FMT.
Note that in the example FMT=101, and 101 is the label of our format statement:
format(1x, *(g0, ", "))
what this does is: "1x" insert a white space at the beginning of the row; the "*" is used for unlimited format repetition, which means that the format in the following parentheses is repeated for all the data left in the object we are printing (i.e. all elements in the matrix's row). Thus, each row number is formatted as: 'g0, ", "'.
g is a general format descriptor that handles floats as well as characters, logicals and integers; the trailing 0 basically means: "use the least amount of space needed to contain the object to be formatted" (avoids unnecessary spaces). Then, after the formatted number, we require the comma plus a space: **", ". This produces our comma-separated values for a row of the matrix (you can use other separators instead of "," if you need). We repeat for every row and that's it.
(The spaces in the format are not really needed, thus one could use format(*(g0,","))
Reference: Metcalf, M., Reid, J., & Cohen, M. (2018). Modern Fortran Explained: Incorporating Fortran 2018. Oxford University Press.
Tens seconds work with a search engine finds me the FLIBS library, which includes a module called csv_file which will write strings, scalars and arrays out is CSV format.

Resources