How to drop duplicates in Vaex? - python-3.x

I have some entries from users and how many interactions this user had on my website...
I have 340k rows and 70+ columns, and I want to use Vaex, but I'm having problems to do simple things like to drop duplicates.
Could someone help me on how to do it?
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Bob', 'Alice', 'Alice', 'Alice', "Ralph", "Ralph"],
'date': ['2013-12-05', '2014-02-05', '2013-11-07', '2014-04-22', '2014-04-30', '2014-04-20', '2014-05-29'],
'interaction_num': ['1', '2', '1', '2', '3', '1','2']})
I want to have the same result of the pandas.drop_duplicates(keep="last") function
df.drop_duplicates('user', keep='last', inplace=True)
the expected result using Vaex should be:
user date interaction_num
1 Bob 2014-02-05 2
4 Alice 2014-04-30 3
6 Ralph 2014-05-29 2

Duplicate question
It seems there is none yet, but we should expect this functionality at some point.
In the meantime, there is an attempt from the creator of vaex

The code adapted from https://github.com/vaexio/vaex/pull/1623/files works for me:
def drop_duplicates(df, columns=None):
"""Return a :class:`DataFrame` object with no duplicates in the given columns.
.. warning:: The resulting dataframe will be in memory, use with caution.
:param columns: Column or list of column to remove duplicates by, default to all columns.
:return: :class:`DataFrame` object with duplicates filtered away.
"""
if columns is None:
columns = df.get_column_names()
if type(columns) is str:
columns = [columns]
return df.groupby(columns, agg={'__hidden_count': vaex.agg.count()}).drop('__hidden_count')

Related

Extract value from list of dictionaries in dataframe using pandas

I have this dataframe with 4 columns. I want to extract resourceName (i.e IDs ) in one separate column. I tried various methods and loops but unable to seperate it.
Dataset:
Username
Event name
Resources
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
I want one more column IDs which will have IDS from Resource column and ll look like this:
Username
Event name
Resources
IDS
XYZ-DEV_ENV_POST_function
StopInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]
i-05fbb7a"
XYZ-DEV_ENV_POST_function
StartInstances
[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]
i-08bd2475 , i-0fd69dc1 , i-0174
Here is output of data.head(2).to_dict():
{'Date':
{0: '28-02-2022', 1: '28-02-2022'},
'Event name':
{0: 'StopInstances', 1: 'StartInstances'},
'Resources':
{
0: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-05fbb7a"}]',
1: '[{"resourceType":"AWS::EC2::Instance","resourceName":"i-08bd2475"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0fd69dc1"},{"resourceType":"AWS::EC2::Instance","resourceName":"i-0174dd38aea"}]'},
'User name': {0: 'XYZ-DEV_ENV_POST_function', 1:
'XYZ-DEV_ENV_POST_function'}}
Thanks and Regards
df['ID'] = df['Resources'].apply(lambda x: ','.join([i['resourceName'] for i in eval(x)]))
Date ... ID
0 28-02-2022 ... i-05fbb7a
1 28-02-2022 ... i-08bd2475,i-0fd69dc1,i-0174dd38aea

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

extracting data from numpy array in python3

I imported my csv file into a python using numpy.txt and the results look like this:
>>> print(FH)
array([['Probe_Name', '', 'A2M', ..., 'POS_D', 'POS_E', 'POS_F'],
['Accession', '', 'NM_000014.4', ..., 'ERCC_00092.1',
'ERCC_00035.1', 'ERCC_00034.1'],
['Class_Name', '', 'Endogenous', ..., 'Positive', 'Positive',
'Positive'],
...,
['CF33294_10', '', '6351', ..., '1187', '226', '84'],
['CF33299_11', '', '5239', ..., '932', '138', '64'],
['CF33300_12', '', '37372', ..., '981', '202', '58']], dtype=object)
every single list is a column and the first item of every column is the header. I want to plot the data in different ways. to do so, I want to make variable for every single column. for example the first column I want to print(Probe_Name) as the header and the results will be shown like this:
A2M
.
.
.
POS_D
POS_E
POS_F
and this is the case for the rest of columns. and then I will plot the variables.
I tried to do that in python3 like this:
def items(N_array:)
for item in N_array:
name = item[0]
content = item[1:]
return name, content
print(items(FH))it does not return what I expect. do you know how to fix it?
One simple way to do this is with pandas dataframes. When you read the csv file using a pandas dataframe, you essentially get a collection of 'columns' (called series in pandas).
import pandas as pd
df = pd.read_csv("your filename.csv")
df
Probe_Name Accession
0 A2m MD_9999
1 POS_D NM_0014.4
2 POS_E 99999
Now we can deal with each column, which is named automatically by the header column.
print(df['Probe_Name'])
0 A2m
1 POS_D
2 POS_E
Furthermore, you can you do plotting (assuming you have numeric data in here somewhere).
http://pandas.pydata.org/pandas-docs/stable/index.html

assigning a list of tokens as a row to dataframe

I am attempting to create a dataframe where the first column is a list of tokens and where additional columns of information can be added. However pandas will not allow a list of tokens to be added as one column.
So code looks as below
array1 = ['two', 'sample', 'statistical', 'inferences', 'includes']
array2 = ['references', 'please', 'see', 'next', 'page', 'the','material', 'of', 'these']
array3 = ['time', 'student', 'interest', 'and', 'lecturer', 'preference', 'other', 'topics']
## initialise list
list = []
list.append(array1)
list.append(array2)
list.append(array3)
## create dataFrame
numberOfRows = len(list)
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns = ('data', 'diversity'))
df.iloc[0] = list[0]
the error message reads
ValueError: cannot copy sequence with size 6 to array axis with dimension 2
Any insight into how I can better achieve creating a dataframe and updating columns would be appreciated.
Thanks
ok so the answer was fairly simple posting it for prosperity.
When adding lists as rows I needed to include the column name and position..
so the code looks like below.
df.data[0] = array1

Resources