Stata behaviour on macros, different outputs - statistics

I have a manual list I created in a macro in stata, something like
global list1 "a b c d"
which I later iterate through with something like
foreach name in $list1 {
action
}
I am trying to change this to a DB driven list because the list is getting big and changing quickly, I create a new $list1 with the following commands
odbc load listitems=items, exec("SELECT items from my_table")
levelsof listitems
global list1=r(levels)
The items on each are the same, but this list seems to be different and when I have too many items it break on the for loop with the error
{ required
r(100);
Also, when I run only levelsof listitems I get the output
`"a"' `"b"' `"c"' `"d"'
Which looks a little bit different than the other macros.
I've been stuck in this for a while. Again, it only fails when the number of items becomes large (over 15), any help would be very appreciated.

Solution 1:
levelsof listitems, clean local(list1)
foreach name of local list1 {
...action with `name'...
}
Solution 2:
levelsof listitems, clean
global list1 `r(levels)'
foreach name of global list1 {
...action with `name'...
}
Explanation:
When you type
foreach name in $list1 {
then whatever is in $list1 gets substituted inline before Stata ever sees it. If global macro list1 contains a very long list of things, then Stata will see
foreach name in a b c d e .... very long list of things here ... {
It is more efficient to tell Stata that you have a list of things in a global or local macro, and that you want to loop over those things. You don't have to expand them out on the command line. That is what
foreach name of local list1 {
and
foreach name of global list1 {
are for. You can read about other capabilities of foreach in -help foreach-.
Also, you originally coded
levelsof listitems
global list1=r(levels)
and you noted that you saw
`"a"' `"b"' `"c"' ...
as a result. Those are what Stata calls "compound quoted" strings. A compound quoted string lets you effectively nest quoted things. So, you can have something like
`"This is a string with `"another quoted string"' inside it"'
You said you don't need that, so you can use the "clean" option of levelsof to not quote up the results. (See -help levelsof- for more info on this option.) Also, you were assigning the returned result of levelsof (which is in r(levels)) to a global macro afterward. It turns out -levelsof- actually has an option named -local()- where you can specify the name of a local (not global) macro to directly put the results in. Thus, you can just type
levelsof listitems, clean local(list1)
to both omit the compound quotes and to directly put the results in a local macro named list1.
Finally, if you for some reason don't want to use that local() option and want to stick with putting your list in a global macro, you should code
global list1 `r(levels)'
rather than
global list1=r(levels)
The distinction is that the latter treats r(levels) as a function and runs it through Stata's string expression parser. In Stata, strings (strings, not macros containing strings) have a limit of 244 characters. Macros containing strings on the other hand can have thousands of characters in them. So, if r(levels) had more than 244 characters in it, then
global list1=r(levels)
would end up truncating the result stored in list1 at 244 characters.
When you instead code
global list1 `r(levels)'
then the contents of r(levels) are expanded in-line before the command is executed. So, Stata sees
global list1 a b c d e ... very long list ... x y z
and everything after the macro name (list1) is copied into that macro name, no matter how long it is.

Related

Check if string is in list with python

I'm new to python, and I'm trying to check if a String is inside a list.
I have these two variables:
new_filename: 'SOLICITUDES2_20201206.DAT' (str type)
and
downloaded_files:
[['SOLICITUDES-20201207.TXT'], ['SOLICITUDES-20201015.TXT'], ['SOLICITUDES2_20201206.DAT']] (list type)
for checking if the string is inside the list, I'm using the following:
if new_filename in downloaded_files:
print(new_filename,'downloaded')
and I never get inside the if.
But if I do the same, but with hard-coded text, it works:
if ['SOLICITUDES2_20201206.DAT'] in downloaded_files_list:
print(new_filename,'downloaded')
What am I doing wrong?
Thanks!
Your downloaded_files is a list of lists. A list can contain anything insider it, numbers, list, dictionaries, strings and etc. If you are trying to find if your string is inside the list, the if statement will only look for identical matches, i.e., strings.
What I suggest you do is get all the strings into a list instead of a list of lists. You can do it using list comprehension:
downloaded_files = [['SOLICITUDES-20201207.TXT'], ['SOLICITUDES-20201015.TXT'], ['SOLICITUDES2_20201206.DAT']]
downloaded_files_list = [file[0] for file in downloaded_files]
Then, your if statement should work:
new_filename = 'SOLICITUDES2_20201206.DAT'
if new_filename in downloaded_files_list:
print(new_filename,'downloaded')
Your code is asking if a string is in a list of lists of a single string each, which is why it doesn't find any.

Py3o.template : How to increment py3o variable inside a loop?

I'm working on .odt report using py3o.template in Odoo. In the report, I need to add a number in a Group, so if I have 3 groups, It should print out : Group 1Group 2Group 3
I did it like in this image :
It doesn't work and I get these errors :
"Could not move siblings for '%s'" % py3o_base
py3o.template.main.TemplateException: Could not move siblings for
'with="i=1"'
Can you help me?
PS : I need to call "i" between Group and [date_start]
This error happens when you did not have close properly your tags.
I can see only one closing with, try to close the two others.
Check the following example:
with="index=1"
with="index=index+1"
with="index=index+1"
function="index" index=3
/with
function="index" index=2
/with
function="index" index=1
/with
function="index" index undefined
Note that if a variable of the same name already existed outside of the scope of the with directive, it will not be overwritten. Instead, it will have the same value it had prior to the with assignment. Effectively, this means that variables are immutable.
To avoid that we can use a list to hold indexes and in each iteration we will create a new variable to hold the new index (we can use list.pop(0) to remove the previous index).
with="index=[1]"
for="item in range(4)"
function="index[-1]"
with="_=index.append(index[-1]+1)"
/with
with="_=index.pop(0)"
/with
/for
At the end index = function="str(index)"
/with
The output should be:
1
2
3
4
At the end index == [5]
To enumerate the loop items I highly recommend using enumerate function.
for="index,d in enumerate(o.get_session_date_ids(), 1)"
function="index"

PowerShell on CSV file - looking for string depending on string

I need your help regarding PowerShell programming on CSV file.
I've made some searches but cannot find what I'm looking for (or perhaps I don't know the technical terms). Basically, I have an Excel workbook with large amount of data (more or less 38 columns x 350.000 rows), and there are a couple of formulas that take hours to calculate.
I was first wondering if PowerShell could speed up a bit the calculation compared to Excel. The calculations taking most of my time are in fact not that complex (at least at first glance). My data is more or less constructed like this:
Ref Title
----- --------------------------
A/001 "free_text"
A/002 "free_text A/001 free_text"
... ...
A/005 "free_text A/004 free_text"
A/006 "free_text"
B/001 "free_text"
B/002 "free_text"
C/001 "free_text"
C/002 "free_text"
...
C/050 "free_text C/047 free_text"
... ...
C/103 "free_text"
D/001 "free_text"
D/002 "free_text D/001 free_text"
... ....
Basically the data is as follows:
the Ref field contains unique values, in {letter}/{incremental value} format.
In some rows, the Title field may call up one of the Ref data. For example, in line 2, the Title calls for the A/001 Ref. In the last row, the Title calls for the D/001 Ref, etc.
There is no logic pattern defining when this ref could be called up in a title. This is random.
However, what I'm 100% sure of is the following:
The Ref called in the Title is always belonging to the same {letter} block. For example: the string 'C/047' in the Title field can only be found in the block where the Ref {letter} is C.
The Ref called in the Title will always be located 'after' (or in a lower row) than the Ref it refers to. In other words, I cannot have a line with following pattern:
Ref Title
------------ -----------------------------------------
{letter/i} {free_text {letter/j} free_text} with j<i
→ This is not possible.
→ j is always > i
I've used these characteristics in Excel to minimize my lookup arrays. But it still takes an hour to calculate everything.
I've therefore looked into PowerShell, and started to 'play' a bit with the CSV, and looping with the ForEach-Object hoping I would have quicker results. Up to now I basically ended-up looping twice on my CSV file.
$CSV1 = myfile.csv
$CSV2 = myfile.csv
$CSV1 | ForEach-Object {
# find Title
$TitSearch = $_.$Ref
$CSV2 | ForEach-Object {
if ($_.$Title -eq $TitSearch) {
myinstructions
}
}
}
It works but it's really really really long. So I then tried the following instead of using the $CSV2 | ForEach...:
$CSV | where {$_.$Title -eq $TitleSearch} | % $Ref
In either case, it's too long and not efficient at all. Additionally with these 2 solutions, I'm not using above characteristics which could reduce the lookup array and as already stated, it seems I end up looping twice on the CSV file from its beginning up to the end.
Questions:
Is there a leaner way to do this?
Am I wasting my time with PowerShell?
I though about creating 1 file per Ref {letter} block (1 file for block A, 1 for B, etc...). However I have about 50.000 blocks to create. Or create them one by one, carry out the analysis, put the results in a new file, and delete them. Would that be quicker?
Note: this is for work, to be used by other colleagues, and Excel and PowerShell are really the only softwares we may use. I know VBA but ok... At the end I'm curious about how and if this can be solved in a simple manner using PowerShell.
As far as I can see your base algorithm do N^2 iteration (~120 billion). There is a standard way to make it efficient - you need to build a hashtable first. Hashtable is a key/value storage, and look up is pretty much instantaneous, so algorithm's time complexity will become ~N.
Powershell has built-in data type for that. In your case the key would be ref, and the value an array of cell data (assuming your table is smth like: ref, title, col1, ..., colN)
$hash = #{}
foreach($row in $table} {$hash.Add($row.ref, #($row.title, $row.col1, ...)}
#it will take 350K steps to generate it
#then you can iterate over it again
foreach($key in $hash.Keys) {
$key # access current ref
$rowData = $hash.$key # access to current row elements (by index)
$refRowData = $hash[$rowData[$j]] # lookup from other rows, assuming lookup reference is in some column
}
So it's a general idea how to solve the time issue. To be honest I don't believe you need to recreate a wheel and code it yourself. What you need is a relational database. Since you have excel, you should have MS ACCESS too. Just import your data in there, make ref and title an index, then all you need to do is self join. MS Access suck, but I'm sure it will handle 350K row just fine.
Ideally you'd need to get a database on some corporate MSSQL server (open a ticket, talk to your manger, etc). It will calculate all that in seconds, and then you can link the output to a spreadsheet as well.

Dict key getting overwritten when created in a loop

I'm trying to create individual dictionary entries while looping through some input data. Part of the data is used for the key, while a different part is used as the value associated with that key. I'm running into a problem (due to Python's "everything is an object, and you reference that object" operations method) with this as ever iteration through my loop alters the key set in previous iterations, thus overwriting the previously set value, instead of creating a new dict key and setting it with its own value.
popcount = {}
for oneline of datafile:
if oneline[:3] == "POP":
dat1, dat2, dat3, dat4, dat5, dat6 = online.split(":")
datid = str.join(":", [dat2, dat3])
if datid in popcount:
popcount[datid] += int(dat4)
else:
popcount = { datid : int(dat4) }
This iterates over seven lines of data (datafile is a list containing that information) and should create four separate keys for datid, each with their own value. However, what ends up happening is that only the last value for datid exist in the dictionary when the code is run. That happens to be the one that has duplicates, and they get summed properly (so, at least i know that part of the code works, but the other key entries just are ... gone.
The data is read from a file, is colon (:) separated, and treated like a string even when its numeric (thus the int() call in the if datid in popcount).
What am I missing/doing wrong here? So far I haven't been able to find anything that helps me out on this one (though you folks have answered a lot of other Python questions i've run into, even if you didn't know it). I know why its failing; or, i think i do -- it is because when I update the value of datid the key gets pointed to the new datid value object even though I don't want it to, correct? I just don't know how to fix or work around this behavior. To be honest, its the one thing I dislike about working in Python (hopefully once I grok it, I'll like it better; until then...).
Simply change your last line
popcount = { datid : int(dat4) } # This does not do what you want
This creates a new dict and assignes it to popcount, throwing away your previous data.
What you want to do is add an entry to your dict instead:
popcount[datid] = int(dat4)

Removing comma coming after a list-item word

This is my Python code (I am using Python 2.7.6):
cn = ['here,', 'there']
>>> for c in cn:
if c.endswith(','):
c = c[:-1]
>>> print(cn)
['here,', 'there']
As you can see, cn[0] still has the trailing comma even though I did c = c[:-1]. How do I change cn[0] so that it equals 'here' without the trailing comma?
The problem is that you assign a new value to c but do not update the list.
In order to manipulate the list in place you would need to do something like this:
cn = ['here,', 'there']
for index, c in enumerate(cn):
if c.endswith(','):
cn[index] = c[:-1]
print(cn)
['here', 'there']
Enumerate gives you all elements in the list, along with their index, and then, if the string has a trailing comma you just update the list element and the right index.
The problem with your code was that c was simply holding the string 'here,'. Then you created a new string whithou the comma and assigned it to c, this did not affect the list cn. In order to have any effect on the list you need to set the new value at the desired position.
You could also use list comprehension to achive the same result which might be more pythonic for such a small task: (as mentioned by #AdamSmith)
cn = [c[:-1] if c.endswith(',') else c for c in cn]
This does creates a new list from cn where each element is returned unchanged if it does not end with , and otherwise the comma is cut off before returning the string.
Another thing you could use would be the built-in rstrip function, but it would remove all trialing commas, not just one. It would look something like this (again #AdamSmith pointed this out):
cn = map(lambda x: x.rstrip(','), cn)
What you did was copy the string with an edit, and store it in a new variable named c, with a name that clashed with the original c and overrode it.
Since you didn't change the original, it stays unchanged when the new one goes out of scope and gets cleared up.
You could do something like this to make a new list, containing the changed strings:
newlist = [c.strip(',') for c in cn]
Or, more closely to your example while keeping this approach:
cn = [c[:-1] if c.endswith(',') else c for c in cn]
The alternate loop-with-enumerate approach in another answer will work, but I'm avoiding it because in general, changing lists while iterating over them can lead to mistakes unless done carefully, and that makes me think it's a bad habit to get into in cases where there's another reasonably neat approach.
You're changing the loop variable c, but you aren't changing the list cn. To change the list cn, instead of looping through its values, loop through its indexes. Here's an implementation.
for c in range(len(cn)):
if cn[c].endswith(','):
cn[c] = cn[c][:-1]

Resources