Most frequently occurring numbers across multiple columns using pandas - python-3.x

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)

This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns

df.max(axis=0)
for columns
df.max(axis=1)
for index

Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Related

Creating a list from series of pandas

Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']

When using min() - ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() [duplicate]

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50
If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Python 3 script uses too much memory

As homework for IT lessons I need to write a script which will check for the highest power of 4 which is in modified input number, but I can use only 8MB of RAM. I used for this logarithmic function, so my code looks like this:
from math import log, floor
n = int(input())
numbers = []
for i in range (0, n):
numbers.append(floor(int(input()) / 10))
for i in numbers:
print(4 ** floor(log(i, 4)))
But I checked this script on my PC and it uses more than 8MB!
Partition of a set of 74690 objects. Total size = 8423721 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 23305 31 2100404 25 2100404 25 str
1 19322 26 1450248 17 3550652 42 tuple
2 5017 7 724648 9 4275300 51 types.CodeType
3 9953 13 716915 9 4992215 59 bytes
4 742 1 632536 8 5624751 67 type
5 4618 6 628048 7 6252799 74 function
6 742 1 405720 5 6658519 79 dict of type
7 187 0 323112 4 6981631 83 dict of module
8 612 1 278720 3 7260351 86 dict (no owner)
9 63 0 107296 1 7367647 87 set
<197 more rows. Type e.g. '_.more' to view.>
On my phone, however, this script uses only 2.5MB:
Partition of a set of 35586 objects. Total size = 2435735 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 9831 28 649462 27 649462 27 str
1 9014 25 365572 15 1015034 42 tuple
2 4669 13 261232 11 1276266 52 bytes
3 2357 7 198684 8 1474950 61 types.CodeType
4 436 1 166276 7 1641226 67 type
5 2156 6 155232 6 1796458 74 function
6 436 1 130836 5 1927294 79 dict of type
7 93 0 87384 4 2014678 83 dict of module
8 237 1 62280 3 2076958 85 dict (no owner) 9 1091 3 48004 2 2124962 87 types.WrapperDescriptorType
<115 more rows. Type e.g. '_.more' to view.>
I tried changing list to tuple, but it didn't make any difference.
Is there any possibility to decrease/limit RAM usage?

How can I add an X axis showing plot data seconds to a matplotlib pyplot price volume graph?

The code below plots a price volume chart using data from a tab separated csv file. Each row contains values for those columns: IDX, TRD, TIMESTAMPMS, VOLUME and PRICE. As is, the X axis shows the IDX value. I would like the X axis to display the seconds computed from the timestamp in milliseconds attached to each row. How can this be obtained ?
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
data = pd.read_csv('secondary-2018-08-12-21-32-56.csv', index_col=0, sep='\t')
print(data.head(50))
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(10,5))
ax[0].plot(data.index, data['PRICE'])
ax[1].bar(data.index, data['VOLUME'])
plt.show()
The drawn graph looks like this:
Here are the data as displayed by the
print(data.head(50))
instruction:
TRD TIMESTAMPMS VOLUME PRICE
IDX
1 4 1534102380000 0.363583 6330.41
2 20 1534102381000 5.509219 6329.13
3 3 1534102382000 0.199049 6328.69
4 5 1534102383000 1.055055 6327.36
5 2 1534102384000 0.006343 6328.26
6 4 1534102385000 0.167502 6330.38
7 1 1534102386000 0.002039 6326.69
8 0 1534102387000 0.000000 6326.69
9 4 1534102388000 0.163813 6327.62
10 2 1534102389000 0.007060 6326.66
11 4 1534102390000 0.015489 6327.64
12 5 1534102391000 0.035618 6328.35
13 2 1534102392000 0.006003 6330.12
14 5 1534102393000 0.172913 6328.77
15 1 1534102394000 0.019972 6328.03
16 3 1534102395000 0.007429 6328.03
17 1 1534102396000 0.000181 6328.03
18 3 1534102397000 1.041483 6328.03
19 2 1534102398000 0.992897 6328.74
20 3 1534102399000 0.061871 6328.11
21 2 1534102400000 0.000123 6328.77
22 4 1534102401000 0.028650 6330.25
23 2 1534102402000 0.035504 6330.01
24 3 1534102403000 0.982527 6330.11
25 5 1534102404000 0.298366 6329.11
26 2 1534102405000 0.071119 6330.06
27 3 1534102406000 0.025547 6330.02
28 2 1534102407000 0.003413 6330.11
29 4 1534102408000 0.431217 6330.05
30 3 1534102409000 0.021627 6330.23
31 1 1534102410000 0.009661 6330.28
32 1 1534102411000 0.004209 6330.27
33 1 1534102412000 0.000603 6328.07
34 6 1534102413000 0.655872 6330.31
35 1 1534102414000 0.000452 6328.09
36 7 1534102415000 0.277340 6328.07
37 8 1534102416000 0.768351 6328.04
38 1 1534102417000 0.078893 6328.20
39 2 1534102418000 0.000446 6326.24
40 2 1534102419000 0.317381 6326.83
41 2 1534102420000 0.100009 6326.24
42 2 1534102421000 0.000298 6326.25
43 6 1534102422000 0.566820 6330.00
44 1 1534102423000 0.000060 6326.30
45 2 1534102424000 0.047524 6326.30
46 4 1534102425000 0.748773 6326.61
47 3 1534102426000 0.007656 6330.23
48 1 1534102427000 0.000019 6326.32
49 1 1534102428000 0.000014 6326.34
50 0 1534102429000 0.000000 6326.34
I believe you need to data.setindex('TIMESTAMPMS') to get the axis to autoscale
I dont know if i understood you correctly, try with:
data['TIMESTAMPMS'] = data['TIMESTAMPMS']/1000
ax[0].plot(data['TIMESTAMPMS'], data['PRICE'])
ax[1].bar(data['TIMESTAMPMS'], data['VOLUME'])

Plot individuals home-range with Adehabitat

I am trying to put the name from the individuals of my research in a polygons home-range plot, but after many attempts I still can not achieve it.
Here and example of my data: X and Y are coordinates and id are individuals
X Y id
29 29 4
44 28 7
57 57 5
60 81 11
32 41 4
43 29 7
57 57 5
46 83 11
32 41 4
43 29 7
57 56 5
60 82 11
35 40 4
43 28 7
62 55 5
54 73 11
27 40 4
43 28 7
61 54 5
First, i calculated the home-range of my data with MPC
cp <- mcp((data)[,1],percent=95, unin = c("m"), unout = c("m2"))
And the plot it
plot(cp, axes=TRUE, border = rainbow(12))
But i don´t kown which polygons correspond to each individual, and if possible i need to include the id of my individuals inside each polygon
Any help would be appreciated!!
Thanks
Juan
Here is an example using the example data from the adehabitatHR package, since you not really providing a reproducible example.
library(adehabitatHR)
data("puechabonsp")
cp <- mcp(puechabonsp$relocs[, 1], percent=95, unin = c("m"), unout = c("m2"))
One way would be to use ggplot2 and sf:
library(sf)
library(tidyverse)
st_as_sf(cp) %>% ggplot(., aes(fill = id)) + geom_sf(alpha = 0.5) +
scale_fill_discrete(name = "Animal id")

Resources