How can I convert Rpy2 data frame to python list or tuple? - rpy2

I have a Rpy2 data frame as <class 'rpy2.robjects.vectors.DataFrame'>. How can I convert it to a Python list or tuple with every row as an element? Thanks!

I figured it out. I hope this helps if you are looking for an answer:
output = [tuple([df[j][i] for j in range(df.ncol)]) for i in range(df.nrow)]

I stumbled recently over one potential problem. Given a data frame from R:
| | a | c | b | d |
|---|-------|---|---|-----|
| 1 | info1 | 2 | 1 | op1 |
| 2 | info2 | 3 | 2 | 3 |
| 3 | info3 | 4 | 3 | 3 |
| 4 | info4 | 5 | 4 | 3 |
| 5 | info5 | 6 | 5 | 3 |
| 6 | info6 | 7 | 6 | 3 |
| 7 | 9 | 8 | 7 | 3 |
(yes I know - mixed data types in one column i.e. str and float is maybe not realistic but the same holds true for factors only columns)
The conversion will show the index for columns a and d and not the real values usually intended. The issue is as stated in the rpy2 manual:
R’s factors are somewhat peculiar: they aim at representing a memory-efficient vector of labels, and in order to achieve it are implemented as vectors of integers to which are associated a (presumably shorter) vector of labels. Each integer represents the position of the label in the associated vector of labels.
The following rough draft code is a step towards handling this case:
colnames = list(dataframe.colnames)
rownames=list(dataframe.rownames)
col2data = []
for cn,col in dataframe.items():
if isinstance(col,robjects.vectors.FactorVector) is True:
colevel = tuple(col.levels)
col = tuple(col)
ncol = []
for i in col:
k=i-1
ncol.append(colevel[k])
else:
ncol = tuple(col)
col2data.append((cn,ncol))
col2data.append(('rownames',rownames))
col2data = dict(col2data)
The output is a dict with columnames to values mapping. Using a loop and transposing the list of lists will generate the output as needed.

Related

Assign a string in a column if the corresponding row of another column contains a certain substring, else another string

I have a pandas dataframe like this
index | Creative Size | Business Model
1 | Something trueview |
2 | truviewhello |
3 | dunno |
4 | str |
5 | str |
I want to write a code that if there is 'trueview' in a column assing the label 'CPV' to the corresponding row in Business Model, else it assigns 'CPM'.
Expected output is:
index | Creative Size | Business Model
1 | Something trueview | CPV
2 | truviewhello | CPV
3 | dunno | CPM
4 | str | CPM
5 | str | CPM
I come up with this:
count=0
for i in db_all['Creative Size']:
if 'trueview' in i:
db_all.loc[count, 'Business Model']='CPV'
else:
db_all.loc[count, 'Business Model']='CPM'
count = count+1
It works but it's slow, any better idea?
Use numpy.where with Series.str.contains:
db_all['Business Model'] = np.where(db_all['Creative Size'].str.contains('trueview'),
'CPV',
'CPM')

How to plot values associated to a string array in a pandas df?

I think my question is easy to solve.
I have a simple dataframe with this shape:
+------------+-----------+----------+
| Age_Group | Gene_Name | Degree |
+------------+-----------+----------+
| pediatric | JAK2 | 17 |
| adult | JAK2 | 14 |
| AYA | JAK2 | 11 |
| pediatric | ETV6 | 52 |
| adult | ETV6 | 7 |
| AYA | ETV6 | 4 |
Then it continues repeating for others genes.
My goal is to plot the degree values on the y-axis with different colors depends on the Age Group and the gene names on the x-axis but I have no idea how to make gene names suitable for python plotting function.
You can pivot the data frame and plot. If you want to rename gene names, that can be done beforehand using replace or map.
df.pivot(index = 'Gene_Name', columns = 'Age_Group',values = 'Degree').plot.bar()

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

Excel: How to extract a substring by only knowing the strings location?

So basically, I am trying to attempt what the title says.
This may be easier to explain with an example:
I have a large string, such as - spreadsheet
A B C
______________________________________
1 | spreadsheet | Number | Letter
2 | | 1 | s
3 | | 2 | p
4 | | 3 | r
5 | | 4 | e
6 | | 5 | a
7 | | 6 | d
8 | | 7 | s
9 | | 8 | h
10| | 9 | e
11| | 10 | e
12| | 11 | t
Since the string spreadsheet has 11 characters, the numbers from 1 to 11 are created going down in a column. Each string that the number represents is then placed in the column next to it.
I searched for a function everywhere. Only found MID and FIND. Although both those functions require the knowledge of what string you are searching for exactly.
Does anyone have any idea on how to accomplish this?
Try this formula:
=IF(LEN($A$1)>=B2,MID($A$1,B2,1),"")
You could even get rid of column B by using the row function:
=IF(LEN($A$1)<=(ROW(C2)-ROW($C$1)),MID($A$1,ROW(C2)-ROW($C$1),1),"")
The formula =MID($A$1,B2,1) filled down in your C column will give you the result.

Excel Creating a List from Beginning and End number AND tags

I am trying to create a list from an index of grouped values.
This is very similar to this, however my groups also have "tags" on then that complicate the listings.
Here is an example of my INDEX tab:
| A | B | C | D |
-------------------------
1 | 1 | 1 | 1 | CV |
2 | 1 | 2 | 2 | IS |
3 | 1 | 3 | 3 | IS |
4 | 2 | 4 | 5 | GN |
5 | 2 | 6 | 7 | PS |
6 | 4 | 8 | 11 | SQ |
7 | 2 | 12 | 13 | SS |
8 | 1 | 14 | 14 | AT |
9 | 15 | 15 | 29 | AT |
10| 4 | 30 | 33 | TYP |
Where A is the number of pages, B is the first page, C is the last page and D is the tag. I would also like to add columns such that I can keep a running tally of the tags.
| A | B | C | D | E | F |
---------------------------------------
1 | 1 | 1 | 1 | CV | CV1 | CV1 |
2 | 1 | 2 | 2 | IS | IS1 | IS1 |
3 | 1 | 3 | 3 | IS | IS2 | IS2 |
4 | 2 | 4 | 5 | GN | GN1 | GN2 |
5 | 2 | 6 | 7 | PS | PS1 | PS2 |
6 | 4 | 8 | 11 | SQ | SQ1 | SQ4 |
7 | 2 | 12 | 13 | SS | SS1 | SS2 |
8 | 1 | 14 | 14 | AT | AT1 | AT1 |
9 | 15 | 15 | 29 | AT | AT2 | AT16 |
10| 4 | 30 | 33 | TYP | TYP1 | TYP4 |
Note that the tag could occur multiple times and it may not be in sequential rows.
Here is what I want this to look like for my LIST tab:
| A |
---------
1 | CV1 |
2 | IS1 |
3 | IS2 |
4 | GN1 |
5 | GN2 |
6 | PS1 |
7 | PS2 |
8 | SQ1 |
9 | SQ2 |
10| SQ3 |
11| SQ4 |
and so on...
How do I add the additional columns to the INDEX tab via formulas?
How do I create the LIST via formulas? (...is this even possible?)
The formulas should be pretty simple to write. Just consider what you're trying to accomplish.
Your first formula (in column E) is just taking a running count of the tags (in column D). So you want to count all cells from the first tag up to the corresponding tag where the tag names are the same. That count is to be appended to the tag name.
=$D1 & COUNTIF($D$1:$D1, $D1)
The second formula (in column F) is just taking a running sum of the page counts (in column A). So you want to take the sum of all corresponding page counts from the first tag up to the corresponding tag where the tag names are the same. The sum is to be appended to the tag name.
=$D1 & SUMIF($D$1:$D1, $D1, $A$1:$A1)
Note that the column doesn't change nor does the starting rows of the ranges (hence the need to use absolute ranges). The only thing that changes are the rows of the tag and the row of the end range.
I don't think it would be possible to generate that list through simple formulas. As far as I know, formulas need to have a 1-to-1 correspondence with another range. A single range can yield multiple values so a formula just won't cut it. You'll need to write a VBA script to generate that.
Sub GenerateList()
Dim usedRange As Range
Dim count As Dictionary
Set usedRange = Worksheets("Index").usedRange
Set count = CountValues(usedRange)
Dim output As Range
Dim row As Integer
Dim key As Variant
Set output = Worksheets("List").Columns("A").Rows
output.ClearContents
row = 1
For Each key In count.Keys()
Dim i As Integer
For i = 1 To count(key)
output(row) = key & i
row = row + 1
Next i
Next key
End Sub
Function CountValues( _
usedRange As Range, _
Optional tagsColumn As String = "D", _
Optional valuesColumn As String = "A") As Dictionary
Dim tags As Range
Dim values As Range
Set tags = usedRange.Columns(tagsColumn).Rows
Set values = usedRange.Columns(valuesColumn).Rows
Dim map As New Dictionary
Dim tag As Range
For Each tag In tags
map(tag.Value) = map(tag.Value) + values(tag.row)
Next tag
Set CountValues = map
End Function
This uses a Dictionary so you'll have to reference the scripting runtime.
It sounds like you're just trying to get a list of "Unique Values" on a separate sheet that you can use as your list. Try these pages, there are multiple VBA methods to paste unique items in a range.
Also, Advanced Filter has an option to paste unique values to another location. So none of your repeat tags would appear in this list, only unique ones for your "LIST" tab.
Anyway, not sure if that's what you're wanting, but the question was a smidge vague.
Links here:
Create Unique list
Create Unique list 2

Resources