I got the data from stdin like
x
x
x y
x
x y z
...
and I want to create pandas DataFrame based on that input
df = pd.read_csv(sys.stdin, sep='\t', header=None)
, but the problem here is that my data on third line has more values than the first one and I got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
The questions is: How can I handle this error when I don't know the longest elements chain(separated by \t).
The whole task can be performed in a single instruction,
without any counting of elements in each row.
I prepared such an example reading from a string, using io.StringIO:
df = pd.DataFrame([ ln.rstrip().split('\t') for ln in
io.StringIO(txt).readlines() ]).fillna('')
The list comprehension converts each source line into a list
of fragments (between Tabs).
Then this list comprehension is the data parameter to pd.DataFrame
and note that such a list of rows can contain rows of different length.
I added also fillna('') to convert each NaN into an empty string
(you are free to delete it if you wish).
To run the test, I defined the source variable as:
txt = '''x
x
x y
x
x y z
x
x y z v'''
end executed the above code, getting:
0 1 2 3
0 x
1 x
2 x y
3 x
4 x y z
5 x
6 x y z v
In the target version, replace reading from a string with reading from
stdin.
Related
I am not familiar with pandas and openpyxl package in Python but was just wondering if it is possible to select random, filtered rows from an excel file and save the output in a separate excel file ?
For example:
The data below represents what is in my excel file and what I would like to do is that:
A B C D E F G
1 x -8 x x x 10/12/2019
3 x +9 x x x 28/12/2019
4 x +2 x x x 12/12/2019
1 x -2 x x x 15/12/2019
4 x -9 x x x 19/12/2019
5 x -1 x x x 25/12/2019
Filter to only negative values in column C ("-ve")
Filter to only items (Column G) dated from 15/12/2019 onwards
Randomly select samples (for example, 2) from this filtered list
Save the 2 randomly selected samples in a new excel file, with the headers (A,B,C,D,E,F,G)
Is there any way i could do this ?
Potentially, I will be using the script for well over 20 times.
Thank you !
You can use:
from numpy.random import choice
df['G']=pd.to_datetime(df['G'])
#ensure that datelike column is of type datetime
out=df.loc[df['C'].le(-1) & df['G'].ge(pd.to_datetime('15/12/2019'))]
#Check your 1st and 2nd condition
out.loc[choice(out.index,2)].to_excel('filename.xlsx',index=False)
#get random 2 values from the filtered rows and save it to excel
I have a data frame which includes 3 columns - Test, X and Y. I want to add new columns Xmean which include the mean value of X with a condition on Y for each Test.
For example Xmean include the mean value on X while Y >= 5 for each Test.
import pandas as pd
df=pd.read_csv(r'Downloads\test.txt',delimiter=',',encoding='utf-8')
df_sort=df.sort_values("test")
df_filter=df_sort[df_sort['y']>=5]
applying aggregates function to find mean
df_agg=df_filter.groupby(['test'])['x'].mean()
join two dataframes to get desired output
df_final=pd.merge(df_sort[['test','x','y']],df_agg,on='test')
print(df_final)
Output attached
This question already has an answer here:
Get total values_count from a dataframe with Python Pandas
(1 answer)
Closed 4 years ago.
I am trying to count uniqiue values that are in few columns. My data frame looks like that:
Name Name.1 Name.2 Name.3
x z c y
y p q x
q p a y
Output should looks like below:
x 2
z 1
c 1
y 3
q 2
p 2
a 1
I used a groupby or count_values but couldn't get a correct output. Any ideas ? Thanks All !
Seems you want to consider values regardless of their row or column location. In that case you should collapse the dataframe and just use Counter.
from collections import Counter
arr = np.array(df)
count = Counter(arr.reshape(arr.size))
Another (Pandas-based) approach is to (Series) apply value_counts to multiple columns and then take the sum (column-wise)
df2 = df.apply(pd.Series.value_counts)
print(df2.sum(axis=1).astype(int)
a 1
c 1
p 2
q 2
x 2
y 3
z 1
dtype: int32
how would I add variable numeric prefix to dataframe column names
If I have a DataFrame df
colA colB
0 A X
1 B Y
2 C Z
How would I rename the columns according to the number of columns. Something like this:
1_colA 2_colB
0 A X
1 B Y
2 C Z
The actually number of columns is very large to be renamed manually
Thanks for the help
Use enumerate for count with f-strings and list comprehension:
#python 3.6+
df.columns = [f'{i}_{x}' for i, x in enumerate(df.columns, 1)]
#python below 3.6
#df.columns = ['{}_{}'.format(i, x) for i, x in enumerate(df.columns, 1)]
print (df)
1_colA 2_colB
0 A X
1 B Y
2 C Z
I have posted a similar thread but have now another angle to explore: After doing a covariance analysis between X and Z groupby 2 different levels, I get a DF like
index X Z
(1,1,'X') 2.3 0
...
'1' and '1' are the 2 different levels (I could have chosen '1' and '2'; there are 5 and 10 different levels)
Now I would like to extract each 'element' of the index and have something
index X Z H1 H2 H3
(1,1,'X') 2.3 0 1 1 X
...
I read few posts on slice and dice things - but this is not a normal string is it?
Cheers
(1,1,'X') isn't a string here, It's a tuple.
So you need to split the tuple into multiple columns. You can achieve this
by using apply(pandas.Series)
say your dataframe was df in this case.
df.apply(pandas.series)
In [10]: df['index'].apply(pd.Series)
Out[10]:
0 1 2 3
0 1 1 'X'
You need to add the columns back to original data frame so
df[['H1', 'H2','H3']] = df.apply(pandas.Series)