Issue in getting the comments from YAML using ruamel.yaml - get

Code:
import ruamel.yaml
yaml_str = """\
# comments start
a: 52
# comment for a
b: 50
# comment for b
c: 50
# comment for c
d:
# comment for d
e: 60
# comment for e
f: 70
# comment for f
"""
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_str)
print(data.ca.comment)
print(data.ca.items)
Output:
[None, [CommentToken('# comments start\n', line: 0, col: 0)]]
{'a': [None, None, CommentToken('\n# comment for a\n', line: 2, col: 0), None], 'b': [None, None, CommentToken('\n# comment for b\n', line: 4, col: 0), None], 'c': [None, None, CommentToken('\n# comment for c\n', line: 6, col: 0), None], 'd': [None, None, None, [CommentToken('# comment for d\n', line: 8, col: 4)]]}
Question:
Why isn't it showing comments pertaining to the keys e and f?
What is the correct way to retrieve the comments based on the key say for example. How to get the comments for the key e ( # comment for e) ?

In ruamel.yaml most comments are attached to the dict (or list) like
structure containing the key (or element) after which the comment
occurred.
To get to the comments following keys e and f you need to look at the dict
that is the value for d:
print(data['d'].ca.items)
print('comment post commment for "e":', repr(data['d'].ca.get('e', 2).value))
which gives:
{'e': [None, None, CommentToken('\n # comment for e\n', line: 10, col: 3), None], 'f': [None, None, CommentToken('\n # comment for f\n', line: 12, col: 3), None]}
comment post commment for "e": '\n # comment for e\n'
Please note that the comment for e starts with a newline, indicating there is no end-of-line comment

Related

Why does merging unequally matched rows not work on local dataset?

I have a pandas dataframe with questions (type = 1) and answers (type = 2). col section_id and type are integer. all other col are string. I want to merge the "answer rows" with their corresponding "question rows" (equal values in section_id) before appending some of the answer rows' values as extra columns (Ans, ans_t) to their corresponding "question rows".
c = ['pos', 'Ans', 'Q_ID', 'leg', 'que_l', 'ans_l', 'par', 'ans_f', 'que_date', 'ask', 'M_ID', 'part', 'area', 'que_t', 'ans_t', 'ISO', 'con', 'id', 'section_id', 'type', 'dep', 'off']
d = [[None, None, '16-17/1/2017-11-15/1', '16-17', '14.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ddd', None, 'eee', 'fff', '111865.q2', 24339851, 1, None, None],
[None, None, '16-17/24/17-11-09/1', '16-17', '28.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'ppp', None, 'eee', 'fff', '111867.q1', 24339851, 1, None, None],
[None, None, '16-17/73/17-10-09/1', '16-17', '69.0', None, 'aaa', 'hhh', '2016-11-20', 'Peter Muller', '41749', 'bbb', 'ccc', 'lll', None, 'eee', 'fff', '111863.r0', 24339851, 1, None, None],
['erg', 'wer', '16-17/42/16-10-09/1', '16-17', None, 67.0, 'aaa', 'hhh', '2016-11-20', None, '46753', 'bbb', 'ccc', None, 'ttt', 'eee', 'asd', '111863.r0', 24339851, 2, None, None],
[None, None, '16-17/12/16-12-08/1', '16-17', '37.0', None, 'aaa', 'hhh', '2016-10-10', 'Peter Muller', '41749', 'bbb', 'qqq', 'rrr', None, 'eee', 'fff', '108143.r0', 24303320, 1, None, None],
['erg', 'wer', '16-17/12/16-12-07/1', '16-17', None, 64.0, 'aaa', 'hhh', '2016-10-10', None, '46753', 'bbb', 'qqq', None, 'uuu', 'eee', 'asd', '108143.r0', 24303320, 2, None, None],
[None, None, '16-17/77/16-12-04/1', '16-17', '46.0', None, 'aaa', 'hhh', '2016-10-08', 'Markus John', '34567', 'ztr', 'yyy', 'nnn', None, 'eee', 'www', '127193.q0', 10343145, 1, None, None],
['qwe', 'wer', '16-17/37/17-11-07/1', '16-17', None, 60.0, 'aaa', 'hhh', '2016-12-12', None, '19745', 'bbb', 'gtt', None, 'ooo', 'eee', 'asd', '906213.r0', 23222978, 2, None, None]]
data = pd.DataFrame(d,columns=c)
data.loc[data['type'] == 2, 'Ans.1'] = data['Ans']
data.loc[data['type'] == 2, 'ans_t.1'] = data['ans_t']
my_cols = ['que_t','ans_t','Ans','ans_t','Ans.1','ans_t.1']
data[my_cols] = data.sort_values(['section_id','type']).groupby('section_id')[my_cols].transform(lambda x: x.bfill())
data.dropna(subset=['que_t'],inplace=True)
data.reset_index(drop=True,inplace=True)
print(data)
The code works fine on the minimal reproducible example. Unfortunately the dataset is too large to account for every detail, which is why this example may not necessarily be representative.
Problem: When i run the code on the actual dataset, nothing gets merged, even though i manually checked for section_id duplicates
Before executing the code, i remove empty cells from the dataset
data.where(pd.notnull(data), None)
data.replace(r'^\s+$', np.nan, regex=True, inplace=True)
which doesent solve the problem
Question: How do i need to adjust my code in order to account for details (e.g. encoding, formats, ..) in the dataset that could cause it not to merge?
Appendix:
Someone told me to remove data from the dataset gradually, checking each time that the testcase is still reproducible. If some removal results in the testcase not working then reinstate it and remove something else instead. When there's absolutely nothing that can be removed, you have your minimal data set.
Someone else said i should apply a parsing function to parse the data. Didnt help
def parse(x):
try:
return int(x)
except ValueError:
return np.nan
data['que_t'] = data['que_t'].apply(parse)
data['ans_t'] = data['ans_t'].apply(parse)
data.dtypes
Or should I search for non-number string and replace when with NaN ?
replaced_with_nan = data['col_name'].replace(re.compile('\D+'), np.nan)
data['col_name'] = replaced_with_nan.astype(np.float)
Here is another approach which like the answer from Andrej Kesely returns an empty dataframe when used on the actual dataframe
df1 = data.loc[df.type == 1].copy()
df2 = data.loc[df.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df.loc[:,['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)
If I've understand you correctly, you can filter the dataframe and do .merge:
x = (
df[df["que/ans"] == 1]
.merge(
df[df["que/ans"] == 2],
on="section_id",
how="outer",
suffixes=("_que", "_ans"),
)
.drop(columns=["ans_t_que", "name_que", "ans_len_que", "que_t_ans"])
)
print(x)
Prints:
que/ans_que section_id que_t_que date_que part_que que/ans_ans ans_t_ans name_ans date_ans part_ans ans_len_ans
0 1 444 qtext1 456 bbb 2.0 atext2 Markus 654.0 eee 64.0
1 1 444 qtext3 987 ddd 2.0 atext2 Markus 654.0 eee 64.0
2 1 123 qtext2 789 ccc 2.0 atext1 Alex 123.0 aaa 78.0
3 1 555 qtext4 321 fff NaN NaN NaN NaN NaN NaN
If you are reading your data from csv or excel, I would recommend you to define the dtype during reading. This is to ensure that the keys you use to merge did not have any data loss.
Example:
section_id = 00001234
After reading from csv, it could just be 1234.
df = pd.read_csv(filename, dtype={'section_id' = str})
Hope this will solve your merging issue.
Your last solution following Andrej's answer seems to be working. However, there are some missing context regarding df variable. In addition, all the strings reflecting emptiness is replaced with np.nan. Thus I rewrote it to data as follows:
data.replace(['', 'None', 'nan', None], np.nan, inplace=True)
df1 = data.loc[data.type == 1].copy()
df2 = data.loc[data.type == 2].copy()
merged_df = pd.merge(df1, df2, on='section_id', how='outer')
merged_df = merged_df[['section_id','que_t_x','ans_t_y','Ans_x','Ans_y']]
merged_df.rename(columns={'que_t_x':'que_t','ans_t_y':'ans_t','Ans_x':'Ans','Ans_y':'Ans.1'}, inplace=True)
print(merged_df)

Python replace None with blank with tuple data set by keeping the field name linkage

Currently have below data set which is row Iterator and values are in tuple. I'm trying to formulate code to replace None with blank and access the tuple value based on field name. Any thought how I can do
current data set:
print(data)
Row(('Robert', 'Hoit', None, None, 'TX'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Row(('James', 'Burns', 34, 'M', 'CA'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Row(('Matt', 'Dan', 45, None, 'NY'), {'fname': 0, 'lname': 1, 'Age': 2, 'Gender': 3, 'State': 4})
Approach I took to replace None was convert tuple to list perform replace and convert back to tuple. but during that case, field name linkage got lost. How to replace without droping field name linkage as I want to utilize row.fname and row.lname call in downstream process
for row in data:
a = list(row)
c = ['' if x is None else x for x in a]
d = tuple(c)
print(d.fname)
print(d.age)
Here I'm assuming that data is a tuple that contains tuples and dictionaries. If so, then:
def rem_none(row):
if type(row) != dict:
row = list(row)
for i in range(len(row)):
if row[i] is None:
row[i] = ''
row = tuple(row)
return row
return row
for row in data:
print(rem_none(row))

Ffill and interpolate koalas dataframe

Is it possible to interpolate and ffill different columns in a Koalas dataframe something like this?
%%spark -s sparkenv2
kdf = ks.DataFrame({
'id':[1,2,3,4],
'A': [None, 3, None, None],
'B': [2, 4, None, 3],
'C': [99, None, None, 1],
'D': [0, 1, 5, 4]
},
columns=['id','A', 'B', 'C', 'D'])
kdf['A']=kdf['A'].ffill()
kdf['B']=kdf['B'].interpolate()
For ffill, this is taken from John Paton's blog
from pyspark.sql import Window
from pyspark.sql.functions import last
spark_df = kdf.to_spark()
# define the window
window = Window.orderBy('id').rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_column = last(spark_df['A'], ignorenulls=True).over(window)
# do the fill
spark_df_filled = spark_df.withColumn('A_filled', filled_column)
I have no answer for interpolate - still trying to find it myself.
PS - You can switch to backfill, by changing rowsBetween(0, max.size) and using first() rather than last().

Remove all elements from an array python

1) The question I'm working on:
Write an algorithm for a function called removeAll which takes 3 parameters: an array of array type, a count of elements in the array, and a value. As with the remove method we discussed in class, elements passed the count of elements are stored as None. This function should remove all occurrences of value and then shift the remaining data down. The last populated element in the array should then be set to None. The function then returns the count of “valid” (i.e. non-removed) data elements left. This function should do the removal “by hand” and SHOULD NOT use the remove method.
2) Below I have what I think works for the question, but it seems inefficient and repetitive. Is there any way to simplify it?
'''
def myremove(mylist, elements, val):
for i in range(elements): # Changes all val that needs to be removed to None
if mylist[i] == val:
mylist[i] = None
for i in range(elements):
if mylist[i] is None: # Moves elements remaining left
for j in range(i, elements- 1):
mylist[j] = mylist[j + 1]
mylist[elements- 1] = None
while mylist[0] is None: # If the first element is None move left until it is not
for j in range(i, elements - 1):
mylist[j] = mylist[j + 1]
mylist[elements - 1] = None
for i in range(elements): # Counts remaining elements
if mylist[i] is None:
elements -= 1
return mylist, elements
"""
"""
# Testing the function
print(removeAll([8, 'N', 24, 16, 1, 'N'], 6, 'N'))
print(removeAll([1, 'V', 3, 4, 2, 'V'], 6, 3))
print(removeAll([0, 'D', 5, 6, 9, 'D'], 6, 'N'))
print(removeAll(['X', 'X', 7, 'X', 'C', 'X'], 6, 'X'))
"""
"""
OUTPUT
([8, 24, 16, 1, None, None], 4)
([1, 'V', 4, 2, 'V', None], 5)
([0, 'D', 5, 6, 9, 'D'], 6)
([7, 'C', None, None, None, None], 2)
"""
You can just sort the list based on whether or not the value equals the hole value.
l = [8, 'N', 24, 16, 1, 'N']
sorted(l, key=lambda x: x == 'N')
output:
[8, 24, 16, 1, 'N', 'N']
If you need None instead of the hole value in the output, use a list comprehension and then sort based on None first.
l = [i if i != 'N' else None for i in [8, 'N', 24, 16, 1, 'N']]
sorted(l, key=lambda x: x == None)
[8, 24, 16, 1, None, None]
Then all thats left is to add in the count which you can just get by counting how many elements are None and subtract that from your input parameter.
def myremove(mylist, elements, val):
ret_list = sorted([i if i != val else None for i in mylist], key=lambda x: x == None)
return ret_list, elements - ret_list.count(None)

Initializing matrix python3

I don't know whether this is a bug, or I got a wrong semantic meaning of the * token in arrays:
>>> arr = [None] * 5 # Initialize array of 5 'None' items
>>> arr
[None, None, None, None, None]
>>> arr[2] = "banana"
>>> arr
[None, None, 'banana', None, None]
>>> # right?
...
>>> mx = [ [None] * 3 ] * 2 # initialize a 3x2 matrix with 'None' items
>>> mx
[[None, None, None], [None, None, None]]
>>> # so far, so good, but then:
...
>>> mx[0][0] = "banana"
>>> mx
[['banana', None, None], ['banana', None, None]]
>>> # Huh?
Is this a bug, or did I got the wrong semantic meaning of the __mult__ token?
You're copying the same reference to the list multiple times. Do it like this:
matrix = [[None]*3 for i in range(2)]

Resources