pandas misreads lines in file [duplicate] - python-3.x

I'm trying to read the following file with pandas using python 3.6:
$ cat tmp2.txt
somename nan 0 0 1 0 0 1 11 0.909091 0 0 1 0 0 7 1 1 0 0 0 0 2
somename nan 0 0 1 0 0 1 36 0.972222 0 0 7 0 5 22 0 6 1 0 0 0 2
somename UgzVrvH-ahjgfT9-NfN4AaABAg.8e3_FgQnopN8e4FLHwai7v0 0 1 0 0 0 25 0.920000 0 0 0 0 2 22 0 1 0 0 0 0 0
somename UgxyXxibolL_qOhMsyZ4AaABAg.8eApKy29u5J8eAxINbTH2m0 0 1 0 0 0 13 1.000000 0 0 0 0 1 10 0 2 0 0 0 0 0
somename nan 0 0 0 0 0 2 56 0.839286 0 0 0 0 11 14 5 7 3 0 3 1 10
When I try reading it with pandas :
>>> import pandas as pd
>>> df = pd.read_csv(header=None, filepath_or_buffer="tmp2.txt", delim_whitespace=True, index_col=0)
>>> df.values[2,:]
array(['UgzVrvH-ahjgfT9-NfN4AaABAg.8e3_FgQnopN8e4FLHwai7v0', 0, 1, 0, 0,
0, 25, 0.92, 0.0, 0, 0, 0, 2, 22, 0, 1, 0, 0, 0, 0, 0, nan],
>>> df.values[3,:]
array(['UgxyXxibolL_qOhMsyZ4AaABAg.8eApKy29u5J8eAxINbTH2m0', 0, 1, 0, 0,
0, 13, 1.0, 0.0, 0, 0, 0, 1, 10, 0, 2, 0, 0, 0, 0, 0, nan],
>>> df.values[4,:]
array([nan, 0, 0, 0, 0, 0, 2, 56.0, 0.8392860000000001, 0, 0, 0, 0, 11,
14, 5, 7, 3, 0, 3, 1, 10.0], dtype=object)
As can be seen when I print df.values[2,:] and df.values[3,:] I get an extraneous nan at the end. It seems like this might be an issue with there being a maximum number of characters per line, but the man page for pandas.read_csv does not contain any mention of that.
QUESTION : What causes this and how can I get pandas.read_csv to correctly read this file?

It's similar to this: python pandas - trailing delimiter confuses read_csv
Your input data has trailing delimiters on some or all of the lines. Two easy fixes are to set usecols in read_csv(), or after reading do something like this:
if df[df.columns[-1]].isnull().all():
df.drop(df.columns[-1], axis=1, inplace=True)


Using Pandas to assign specific values

I have the following dataframe:
data = {'id': [1, 2, 3, 4, 5, 6, 7, 8],
'stat': ['ordered', 'unconfirmed', 'ordered', 'unknwon', 'ordered', 'unconfirmed', 'ordered', 'back'],
'date': ['2021', '2022', '2023', '2024', '2025','2026','2027', '1990']
df = pd.DataFrame(data)
I am trying to get the following data frame:
Unfortunate I am not successful so far and I used the following commands (for loops) for only stat==ordered:
y0 = np.zeros((len(df), 8), dtype=int)
y1 = [1990]
if stat=='ordered':
for i in df['id']:
for j in y1:
if df.loc[i].at['date'] in y1:
y0[i][y1.index(j)] = 1
y0[i][y1.index(j)] = 0
But unfortunately it did not returned the expected solution and beside that it takes a very long time to do the calculation. I tried to use gruopby, but it could not fgure out either how to use it perporly since it is faster than using for loops. Any idea would be very appreiciated.
[1, 2, 1, 3, 1, 2, 1, 0], axis=0
id stat date 1990 2021 2022 2023 2024 2025 2026 2027
0 1 ordered 2021 0 1 1 1 1 1 1 1
1 2 unconfirmed 2022 0 0 2 2 2 2 2 2
2 3 ordered 2023 0 0 0 1 1 1 1 1
3 4 unknwon 2024 0 0 0 0 3 3 3 3
4 5 ordered 2025 0 0 0 0 0 1 1 1
5 6 unconfirmed 2026 0 0 0 0 0 0 2 2
6 7 ordered 2027 0 0 0 0 0 0 0 1
7 8 back 1990 0 0 0 0 0 0 0 0

Diagonal Dataframe to 1 row

I need to convert a diagonal Dataframe to 1 row Dataframe.
df = pd.DataFrame([[7, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 8],],
0 7 0 0 0
1 0 2 0 0
2 0 0 3 0
3 0 0 0 8
Expected output:
0 7 2 3 8
what i tried so far to do this:
df1 = df.sum().to_frame().transpose()
0 7 2 3 8
It does the job. But is there any elegant way to do this by groupby or some other pandas builtin?
Not sure if there is any other 'elegant' way, I can only propose alternatives:
Use numpy.diagonal
pd.DataFrame([df.to_numpy().diagonal()], columns=df.columns)
0 7 2 3 8
Use groupby with boolean (not sure if this is better than your solution):
df.groupby([True] * len(df), as_index=False).sum()
0 7 2 3 8
You can use: np.diagonal(df):
pd.DataFrame(np.diagonal(df), df.columns).T
0 7 2 3 8

Check if any row has the same values as a numpy array

I am working with a pandas.Dataframe that looks as follows:
1 0 0 0 1
2 1 0 0 1
3 ...
4 ...
And I am creating a numpy.arrays that have the same shape as a row within this dataframe. I want to check if the array I am creating 'is present' within the dataframe.
In this case, for example, my array would look like this, if it is in the dataframe:
a= [0,0,0,1]
It is not if it looks like this:
b = [1,1,1,1]
Any help, even if it is a link to the right answer, is much appreciated as I have looked through stackoverflow and fortunately I did not miss anything.
df = pd.DataFrame({'A':[0, 1, 0, 0],
'B':[0, 0, 1, 1],
'C':[0, 0, 0, 0],
'D':[1, 1, 0, 1]})
# A B C D
# 0 0 0 0 1
# 1 1 0 0 1
# 2 0 1 0 0
# 3 0 1 0 1
>>> a = [0, 0, 0, 1]
>>> (df == a).all(axis=1).any()
>>> b = [1, 1, 1, 1]
>>> (df == b).all(axis=1).any()

How can I merge data-frame rows by different columns

I have a DataFrame with 200k rows and some 50 columns with same id in different columns, looking like below:
df = pd.DataFrame({'pic': [1, 0, 0, 0, 2, 0, 3, 0, 0]
, 'story': [0, 1, 0, 2, 0, 0, 0, 0, 3]
, 'des': [0, 0, 1, 0, 0, 2, 0, 3, 0]
, 'some_another_value': [2, 1, 6, 5, 4, 3, 1, 1, 1]
, 'some_value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
pic story des some_another_value some_value
0 1 0 0 2 nan
1 0 1 0 nan 2
2 0 0 1 nan 3
3 0 2 0 nan 4
4 2 0 0 4 nan
5 0 0 2 nan 6
6 3 0 0 1 nan
7 0 0 3 nan 8
8 0 3 0 nan 9
I would like to merge the rows which have the same value in 'pic' 'story' 'des'
pic story des some_another_value some_value
0 1 1 1 2 5
3 2 2 2 4 10
6 3 3 3 1 17
How can this be achieved?
*I am looking for a solution which not contain a for loop
*Prefer not a sum method
I'm not sure why you say Prefer not a sum method when your expected output data clearly indicate sum. For your sample data, in each row, exactly one of pic, story, des is zero, so:
df.groupby(df[['pic','story', 'des']].sum(1)).sum()
pic story des some_another_value some_value
1 1 1 1 2.0 5.0
2 2 2 2 4.0 10.0
3 3 3 3 1.0 17.0

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]
