pandas misreads lines in file [duplicate] - python-3.x

This question already has answers here:
Trailing delimiter confuses pandas read_csv
(3 answers)
Closed 4 years ago.
I'm trying to read the following file with pandas using python 3.6:
$ cat tmp2.txt
somename nan 0 0 1 0 0 1 11 0.909091 0 0 1 0 0 7 1 1 0 0 0 0 2
somename nan 0 0 1 0 0 1 36 0.972222 0 0 7 0 5 22 0 6 1 0 0 0 2
somename UgzVrvH-ahjgfT9-NfN4AaABAg.8e3_FgQnopN8e4FLHwai7v0 0 1 0 0 0 25 0.920000 0 0 0 0 2 22 0 1 0 0 0 0 0
somename UgxyXxibolL_qOhMsyZ4AaABAg.8eApKy29u5J8eAxINbTH2m0 0 1 0 0 0 13 1.000000 0 0 0 0 1 10 0 2 0 0 0 0 0
somename nan 0 0 0 0 0 2 56 0.839286 0 0 0 0 11 14 5 7 3 0 3 1 10
When I try reading it with pandas :
>>> import pandas as pd
>>> df = pd.read_csv(header=None, filepath_or_buffer="tmp2.txt", delim_whitespace=True, index_col=0)
>>> df.values[2,:]
array(['UgzVrvH-ahjgfT9-NfN4AaABAg.8e3_FgQnopN8e4FLHwai7v0', 0, 1, 0, 0,
0, 25, 0.92, 0.0, 0, 0, 0, 2, 22, 0, 1, 0, 0, 0, 0, 0, nan],
dtype=object)
>>> df.values[3,:]
array(['UgxyXxibolL_qOhMsyZ4AaABAg.8eApKy29u5J8eAxINbTH2m0', 0, 1, 0, 0,
0, 13, 1.0, 0.0, 0, 0, 0, 1, 10, 0, 2, 0, 0, 0, 0, 0, nan],
dtype=object)
>>> df.values[4,:]
array([nan, 0, 0, 0, 0, 0, 2, 56.0, 0.8392860000000001, 0, 0, 0, 0, 11,
14, 5, 7, 3, 0, 3, 1, 10.0], dtype=object)
As can be seen when I print df.values[2,:] and df.values[3,:] I get an extraneous nan at the end. It seems like this might be an issue with there being a maximum number of characters per line, but the man page for pandas.read_csv does not contain any mention of that.
QUESTION : What causes this and how can I get pandas.read_csv to correctly read this file?

It's similar to this: python pandas - trailing delimiter confuses read_csv
Your input data has trailing delimiters on some or all of the lines. Two easy fixes are to set usecols in read_csv(), or after reading do something like this:
if df[df.columns[-1]].isnull().all():
df.drop(df.columns[-1], axis=1, inplace=True)

Related

Using Pandas to assign specific values

I have the following dataframe:
data = {'id': [1, 2, 3, 4, 5, 6, 7, 8],
'stat': ['ordered', 'unconfirmed', 'ordered', 'unknwon', 'ordered', 'unconfirmed', 'ordered', 'back'],
'date': ['2021', '2022', '2023', '2024', '2025','2026','2027', '1990']
}
df = pd.DataFrame(data)
df
I am trying to get the following data frame:
Unfortunate I am not successful so far and I used the following commands (for loops) for only stat==ordered:
y0 = np.zeros((len(df), 8), dtype=int)
y1 = [1990]
if stat=='ordered':
for i in df['id']:
for j in y1:
if df.loc[i].at['date'] in y1:
y0[i][y1.index(j)] = 1
else:
y0[i][y1.index(j)] = 0
But unfortunately it did not returned the expected solution and beside that it takes a very long time to do the calculation. I tried to use gruopby, but it could not fgure out either how to use it perporly since it is faster than using for loops. Any idea would be very appreiciated.
IIUC:
df.join(
pd.get_dummies(df.date).cumsum(axis=1).mul(
[1, 2, 1, 3, 1, 2, 1, 0], axis=0
).astype(int)
)
id stat date 1990 2021 2022 2023 2024 2025 2026 2027
0 1 ordered 2021 0 1 1 1 1 1 1 1
1 2 unconfirmed 2022 0 0 2 2 2 2 2 2
2 3 ordered 2023 0 0 0 1 1 1 1 1
3 4 unknwon 2024 0 0 0 0 3 3 3 3
4 5 ordered 2025 0 0 0 0 0 1 1 1
5 6 unconfirmed 2026 0 0 0 0 0 0 2 2
6 7 ordered 2027 0 0 0 0 0 0 0 1
7 8 back 1990 0 0 0 0 0 0 0 0

Diagonal Dataframe to 1 row

I need to convert a diagonal Dataframe to 1 row Dataframe.
Input:
df = pd.DataFrame([[7, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 8],],
columns=list('ABCD'))
A B C D
0 7 0 0 0
1 0 2 0 0
2 0 0 3 0
3 0 0 0 8
Expected output:
A B C D
0 7 2 3 8
what i tried so far to do this:
df1 = df.sum().to_frame().transpose()
df1
A B C D
0 7 2 3 8
It does the job. But is there any elegant way to do this by groupby or some other pandas builtin?
Not sure if there is any other 'elegant' way, I can only propose alternatives:
Use numpy.diagonal
pd.DataFrame([df.to_numpy().diagonal()], columns=df.columns)
A B C D
0 7 2 3 8
Use groupby with boolean (not sure if this is better than your solution):
df.groupby([True] * len(df), as_index=False).sum()
A B C D
0 7 2 3 8
You can use: np.diagonal(df):
pd.DataFrame(np.diagonal(df), df.columns).T
A B C D
0 7 2 3 8

Check if any row has the same values as a numpy array

I am working with a pandas.Dataframe that looks as follows:
A B C D
index
1 0 0 0 1
2 1 0 0 1
3 ...
4 ...
...
And I am creating a numpy.arrays that have the same shape as a row within this dataframe. I want to check if the array I am creating 'is present' within the dataframe.
In this case, for example, my array would look like this, if it is in the dataframe:
a= [0,0,0,1]
It is not if it looks like this:
b = [1,1,1,1]
Any help, even if it is a link to the right answer, is much appreciated as I have looked through stackoverflow and fortunately I did not miss anything.
df = pd.DataFrame({'A':[0, 1, 0, 0],
'B':[0, 0, 1, 1],
'C':[0, 0, 0, 0],
'D':[1, 1, 0, 1]})
# A B C D
# 0 0 0 0 1
# 1 1 0 0 1
# 2 0 1 0 0
# 3 0 1 0 1
>>> a = [0, 0, 0, 1]
>>> (df == a).all(axis=1).any()
True
>>> b = [1, 1, 1, 1]
>>> (df == b).all(axis=1).any()
False

How can I merge data-frame rows by different columns

I have a DataFrame with 200k rows and some 50 columns with same id in different columns, looking like below:
df = pd.DataFrame({'pic': [1, 0, 0, 0, 2, 0, 3, 0, 0]
, 'story': [0, 1, 0, 2, 0, 0, 0, 0, 3]
, 'des': [0, 0, 1, 0, 0, 2, 0, 3, 0]
, 'some_another_value': [2, 1, 6, 5, 4, 3, 1, 1, 1]
, 'some_value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
pic story des some_another_value some_value
0 1 0 0 2 nan
1 0 1 0 nan 2
2 0 0 1 nan 3
3 0 2 0 nan 4
4 2 0 0 4 nan
5 0 0 2 nan 6
6 3 0 0 1 nan
7 0 0 3 nan 8
8 0 3 0 nan 9
I would like to merge the rows which have the same value in 'pic' 'story' 'des'
pic story des some_another_value some_value
0 1 1 1 2 5
3 2 2 2 4 10
6 3 3 3 1 17
How can this be achieved?
*I am looking for a solution which not contain a for loop
*Prefer not a sum method
I'm not sure why you say Prefer not a sum method when your expected output data clearly indicate sum. For your sample data, in each row, exactly one of pic, story, des is zero, so:
df.groupby(df[['pic','story', 'des']].sum(1)).sum()
gives
pic story des some_another_value some_value
1 1 1 1 2.0 5.0
2 2 2 2 4.0 10.0
3 3 3 3 1.0 17.0

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]

Resources