How to put filenames into an array from os.listdir - python-3.x

Lets say you input: os.listdir(r'filepath')
and the output is: ['a.txt','b.txt','c.txt','d.txt','e.txt']
How could you put the file names, ['a', 'b', 'c', 'd', 'e'] into a pandas dataframe?

Use list comprehension with DataFrame contructor:
L = ['a.txt','b.txt','c.txt','d.txt','e.txt']
df = pd.DataFrame({'col':[x.split('.')[0] for x in L]})
print (df)
col
0 a
1 b
2 c
3 d
4 e
Thank you for suggestion #Joe Halliwell, main advantage is general solution, check this:
df = pd.DataFrame({'col': [os.path.splitext(x)[0] for x in L]})

Related

Pandas Dataframe of Unique Triples

I'm currently working on some python dataframes over on pandas. And I'm not sure how this operation can be done. For example, I have an empty dataframe df and list of the following triples:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
I wish to add all these triples into the dataframe df with columns ['id', 'a', 'b', 'c'] according to some constraint. The id is simply a counter that determines how many items have been added so far and a, b, and c are columns for the triples (but they would be commutative with each other). So the idea is to linearly traverse all items in L and then add each one to the df according to the restriction:
It is ok to add (1,2,3) since df is still empty. (id=0)
It is ok to add (2,5,4) since it or any of its permutation has not appeared yet in df. (id=1)
We then see (2,5,4) but this already exists in df, hence we cannot add it.
Next is (3,2,0) and we can clearly add this for the same reason as #2. (id=2)
Finally, it's (2,1,3). While this triple has not existed yet in df but since it's a permutation to an existing triplet in df (which is the (1,2,3)), then we cannot add it to df.
In the end, the final df should look something like this.
id a b c
0 1 2 3
1 2 5 4
2 3 2 0
Anyone knows how this can be done? My idea is to first curate an auxiliary list LL that would contain these "unique" triples and then just transform it into a pandas df. But I'm not sure if it's a fast and elegant efficient approach.
Fast solution
Create a numpy array from the list, then sort the array along axis=1 and use duplicated to create a boolean mask to identify dupes, then remove the duplicate rows from the array and create a new dataframe
a = np.array(L)
m = pd.DataFrame(np.sort(a, axis=1)).duplicated()
pd.DataFrame(a[~m], columns=['a', 'b', 'c'])
Result
a b c
0 1 2 3
1 2 5 4
2 3 2 0
You can use a dictionary comprehension with a frozenset of the tuple as key to eliminate the duplicated permutations, then feed the values to the DataFrame constructor:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
df = pd.DataFrame({frozenset(t): t for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 3 2 0
2 2 5 4
If order is important, you can use a set to collect the seen values instead:
seen = set()
df = pd.DataFrame([t for t in L if (f:=frozenset(t)) not in seen
and not seen.add(f)],
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 2 5 4
2 3 2 0
handling duplicates values in the tuple
df = pd.DataFrame({tuple(sorted(t)): t
for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
If there are many columns, sorting becomes inefficient, then you can use a Counter:
from collections import Counter
df = pd.DataFrame({frozenset(Counter(t).items()): t
for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
pure pandas alternative:
You can do the same with pandas using loc and aggregation to set:
df = pd.DataFrame(L).loc[lambda d: ~d.agg(set, axis=1).duplicated()]

keep rows based on the values of a given column in pandas

Given a data frame, I would like to keep the rows where a given column value does not match the given strings.
For instance, if the column 'En' does not match 'x1', I will keep those rows. I use the following code to do it.
df1 = df1.loc[df1['En'] != 'x1']
If instead of x1 only, there are x1and x2 need to be examined. In other words, I will only keep the rows whose 'En' column does not match either x1 or x2. What's the most efficient way to do that.
This is how I did
df1 = df1.loc[df1['En'] != 'x1']
df1 = df1.loc[df1['En'] != 'x2']
Use logical AND operator :
df1 = df1.loc[(df1['En'] != 'x1') & (df1['En'] != 'x2')]
Are you looking for something like this?
import pandas as pd
df1 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
'b' : ['x', 'y', 'z', 'x', 'y', 'x', 'z'],
'c' : [1,2,3,4,5,6,7]})
print(df1)
df2 = df1.loc[(df1['b'] != 'x') & (df1['b'] != 'y') ]
print (df2)
If df1 is :
a b c
0 one x 1
1 one y 2
2 two z 3
3 three x 4
4 two y 5
5 one x 6
6 six z 7
then df2 will be:
a b c
2 two z 3
6 six z 7
An alternate way to do this is using query.
df2 = df1.query("b != 'x' & b != 'y'")
OR
df2 = df1.query("b != ['x','y']")
This will also give you the same result.
For more information about using some of these operators, see https://pandas.pydata.org/pandas-docs/version/0.13.1/indexing.html

Pandas Dataframe, sum single value grouped by multiple columns

I have searched for this answer but cannot find something which will work. I want to sum a column keyword_visibility and group it by three columns category, trend_month, trend_year.
The result would be in the same dataframe and would be called sum_keyword_visibility_by_category.
What I have tried includes:
df_market_share['sum_keyword_visibility_by_category'] = df_market_share.groupby(['category', 'trend_month', 'trend_year'])['keyword_visibility'].sum()
and
df_market_share['sum_keyword_visibility_by_category'] = df_market_share["keyword_visibility"].groupby(df_market_share["category"], ["trend_month" ]).transform("sum")
Error I am getting for the first attempt is this TypeError: incompatible index of inserted column with frame index and for the second attempt this TypeError: unhashable type: 'list' any help is much appreciated
That is because you are grouping values. you are trying to insert the result of a groupby and summation into the normal indexes of your data frame.
Which means that you are trying to insert a smaller set of values into the new column.
Check this link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
If you want to insert the results into your dataframe you can locate the corresponding values and insert the results with df.loc
If I am understanding the question correctly, you want to use transform. The following example groups by two columns, but it should be clear how to extend to three:
data = [
['A', 'C', 1 ],
['A', 'D', 2 ],
['A', 'C', 2 ],
['B', 'C', 3 ],
['B', 'D', 4],
['B', 'C', 4]
]
df = pd.DataFrame(data, columns=['col1', 'col2', 'col_to_sum'])
df['summed_col'] = df.groupby(['col1', 'col2']).col_to_sum.transform('sum')
df
Output:
col1 col2 col_to_sum summed_col
0 A C 1 3
1 A D 2 2
2 A C 2 3
3 B C 3 7
4 B D 4 4
5 B C 4 7

How can I select columns which are not consecutive in Python Pandas

How can I select columns which are not consecutive?
e.g.
index a b c
1 2 3 4
2 3 4 5
How do I select 'a', 'c' and save it in to df1?
df1 = df.log[:, 'a''c'] but it doesn't work...
you can use
df1=df[['a','c']]
to get the result
First of all I believe there is a typo. Your code should be
df1 = df.loc[:, ['a', 'c]]

Group by and Count Function returns NaNs [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Resources