How to create an "inkblot" chart with R? - graphics
How can I create a chart like
http://junkcharts.typepad.com/junk_charts/2010/01/leaving-ink-traces.html
where several time series (one per country) are displayed horizontally as symmetric areas?
I think if I could display one time series in this way, it is easy to generalize to several using mfrow.
Sample data:
#Solar energy production in Europe, by country (EC),(1 000 toe)
Country,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007
Belgium,1,1,1,1,1,1,2,2,3,3,3,5
Bulgaria,-,-,-,-,-,-,-,-,-,-,-,-
Czech Republic,0,0,0,0,0,0,0,0,2,2,3,4
Denmark,6,7,7,8,8,8,9,9,9,10,10,11
Germany (including ex-GDR from 1991),57,70,83,78,96,150,184,216,262,353,472,580
Estonia,-,-,-,-,-,-,-,-,-,-,-,-
Ireland,0,0,0,0,0,0,0,0,0,0,1,1
Greece,86,89,93,97,99,100,99,99,101,102,109,160
Spain,26,23,26,29,33,38,43,48,58,65,83,137
France,15,16,17,18,26,19,19,18,19,22,29,37
Italy,8,9,11,11,12,14,16,18,21,30,38,56
Cyprus,32,33,34,35,35,34,35,36,40,41,43,54
Latvia,-,-,-,-,-,-,-,-,-,-,-,-
Lithuania,-,-,-,-,-,-,-,-,-,-,-,-
Luxembourg (Grand-Duché),0,0,0,0,0,0,0,0,1,2,2,2
Hungary,0,0,0,0,0,1,2,2,2,2,2,3
Netherlands,6,7,8,10,12,14,16,19,20,22,22,23
Austria,42,48,55,58,64,69,74,80,86,92,101,108
Poland,0,0,0,0,0,0,0,0,0,0,0,0
Portugal,16,16,17,18,18,19,20,21,21,23,24,28
Romania,0,0,0,0,0,0,0,0,0,0,0,0
Slovenia,-,-,-,-,-,-,-,-,-,-,-,-
Slovakia,0,0,0,0,0,0,0,0,0,0,0,0
Finland,0,0,0,0,1,1,1,1,1,1,1,1
Sweden,4,4,5,5,5,6,4,5,5,6,6,9
United Kingdom,6,6,7,7,11,13,16,20,25,30,37,46
Croatia,0,0,0,0,0,0,0,0,0,0,0,1
Turkey,159,179,210,236,262,287,318,350,375,385,402,420
Iceland,-,-,-,-,-,-,-,-,-,-,-,-
Norway,0,0,0,0,0,0,0,0,0,0,0,0
Switzerland,18,19,21,23,24,26,23,24,25,26,28,30
#-='Not applicable' or 'Real zero' or 'Zero by default' :=Not available "
#Source of Data:,Eurostat, http://spreadsheets.google.com/ccc?key=0Agol553XfuDZdFpCQU1CUVdPZ3M0djJBSE1za1NGV0E&hl=en_GB
#Last Update:,30.04.2009
#Date of extraction:,17 Aug 2009 07:41:12 GMT, http://epp.eurostat.ec.europa.eu/tgm/table.do?tab=table&init=1&plugin=1&language=en&pcode=ten00082
You can use polygon in base graphics, for instance
x <- seq(as.POSIXct("1949-01-01", tz="GMT"), length=36, by="months")
y <- rnorm(length(x))
plot(x, y, type="n", ylim=c(-1,1)*max(abs(y)))
polygon(c(x, rev(x)), c(y, -rev(y)), col="cornflowerblue", border=NA)
Update: Using panel.polygon from lattice:
library("lattice")
library("RColorBrewer")
df <- data.frame(x=rep(x,3),
y=rnorm(3*length(x)),
variable=gl(3, length(x)))
p <- xyplot(y~x|variable, data=df,
ylim=c(-1,1)*max(abs(y)),
layout=c(1,3),
fill=brewer.pal(3, "Pastel2"),
panel=function(...) {
args <- list(...)
print(args)
panel.polygon(c(args$x, rev(args$x)),
c(args$y, -rev(args$y)),
fill=args$fill[panel.number()],
border=NA)
})
print(p)
With a little polishing, this ggplot solution will look like what you want:
alt text http://www.imagechicken.com/uploads/1264790429056858700.png
Here's how to make it from your data:
require(ggplot2)
First, let's take your input data and import and restructure it into a form ggplot likes:
rdata = read.csv("data.csv",
# options: load '-' as na, ignore first comment line #Solar,
# strip whitespace that ends line, accept numbers as col headings
na.strings="-", skip=1, strip.white=T, check.names=F)
# Convert to long format and check years are numeric
data = melt(rdata)
data = transform(data,year=as.numeric(as.character(variable)))
# geom_ribbon hates NAs.
data = data[!is.na(data$value),]
> summary(data)
Country variable value year
Austria : 12 1996 : 25 Min. : 0.00 Min. :1996
Belgium : 12 1997 : 25 1st Qu.: 0.00 1st Qu.:1999
Croatia : 12 1998 : 25 Median : 7.00 Median :2002
Cyprus : 12 1999 : 25 Mean : 36.73 Mean :2002
Czech Republic: 12 2000 : 25 3rd Qu.: 30.00 3rd Qu.:2004
Denmark : 12 2001 : 25 Max. :580.00 Max. :2007
(Other) :228 (Other):150
Now let's plot it:
ggplot(data=data, aes(fill=Country)) +
facet_grid(Country~.,space="free", scales="free_y") +
opts(legend.position="none") +
geom_ribbon(aes(x=year,ymin=-value,ymax=+value))
Using rcs' first approach, here a solution for the sample data with base graphics:
rawData <- read.csv("solar.csv", na.strings="-")
data <- ts(t(as.matrix(rawData[,2:13])), names=rawData[,1], start=1996)
inkblot <- function(series, col=NULL, min.height=40, col.value=24, col.category=17, ...) {
# assumes non-negative values
# assumes that series is multivariate series
# assumes that series names are set, i.e. colnames(series) != NULL
x <- as.vector(time(series))
if(length(col)==0){
col <- rainbow(dim(series)[2])
}
ytotal <- 0
for(category in colnames(series)) {
y <- series[, category]
y <- y[!is.na(y)]
ytotal <- ytotal + max(y, min.height)
}
oldpar = par(no.readonly = TRUE)
par(mar=c(2,3,0,10)+0.1, cex=0.7)
plot(x, 1:length(x), type="n", ylim=c(0,1)*ytotal, yaxt="n", xaxt="n", bty="n", ylab="", xlab="", ...)
axis(side=1, at=x)
catNumber <- 1
offset <- 0
for(category in rev(colnames(series))) {
print(paste("working on: ", category))
y <- 0.5 * as.vector(series[,category])
offset <- offset + max(max(abs(y[!is.na(y)])), 0.5*min.height)
print(paste("offset= ", str(offset)))
polygon(c(x, rev(x)), c(offset+y, offset-rev(y)), col=col[catNumber], border=NA)
mtext(text=y[1], side=2, at=offset, las=2, cex=0.7, col=col.value)
mtext(text=y[length(y)], side=4, line=-1, at=offset, las=2, cex=0.7, col=col.value)
mtext(text=category, side=4, line=2, at=offset, las=2, cex=0.7, col=col.category)
offset <- offset + max(max(abs(y[!is.na(y)])), 0.5*min.height)
catNumber <- catNumber + 1
}
}
inkblot(data)
I still need to figure out the vertical grid lines and the transparent coloring.
Late to this game, but I created a stacked "blot" chart using ggplot2 and another set of data. This uses geom_polygon after the data has been smoothed out.
# data: Masaaki Ishida (luna#pos.to)
# http://luna.pos.to/whale/sta.html
head(blue, 2)
## Season Norway U.K. Japan Panama Denmark Germany U.S.A. Netherlands
## ## [1,] 1931 0 6050 0 0 0 0 0 0
## ## [2,] 1932 10128 8496 0 0 0 0 0 0
## ## U.S.S.R. South.Africa TOTAL
## ## [1,] 0 0 6050
## ## [2,] 0 0 18624
hourglass.plot <- function(df) {
stack.df <- df[,-1]
stack.df <- stack.df[,sort(colnames(stack.df))]
stack.df <- apply(stack.df, 1, cumsum)
stack.df <- apply(stack.df, 1, function(x) sapply(x, cumsum))
stack.df <- t(apply(stack.df, 1, function(x) x - mean(x)))
# use this for actual data
## coords.df <- data.frame(x = rep(c(df[,1], rev(df[,1])), times = dim(stack.df)[2]), y = c(apply(stack.df, 1, min), as.numeric(apply(stack.df, 2, function(x) c(rev(x),x)))[1:(length(df[,1])*length(colnames(stack.df))*2-length(df[,1]))]), id = rep(colnames(stack.df), each = 2*length(df[,1])))
## qplot(x = x, y = y, data = coords.df, geom = "polygon", color = I("white"), fill = id)
# use this for smoothed data
density.df <- apply(stack.df, 2, function(x) spline(x = df[,1], y = x))
id.df <- sort(rep(colnames(stack.df), each = as.numeric(lapply(density.df, function(x) length(x$x)))))
density.df <- do.call("rbind", lapply(density.df, as.data.frame))
density.df <- data.frame(density.df, id = id.df)
smooth.df <- data.frame(x = unlist(tapply(density.df$x, density.df$id, function(x) c(x, rev(x)))), y = c(apply(unstack(density.df[,2:3]), 1, min), unlist(tapply(density.df$y, density.df$id, function(x) c(rev(x), x)))[1:(table(density.df$id)[1]+2*max(cumsum(table(density.df$id))[-dim(stack.df)[2]]))]), id = rep(names(table(density.df$id)), each = 2*table(density.df$id)))
qplot(x = x, y = y, data = smooth.df, geom = "polygon", color = I("white"), fill = id)
}
hourglass.plot(blue[,-12]) + opts(title = c("Blue Whale Catch"))
alt text http://probabilitynotes.files.wordpress.com/2010/06/bluewhalecatch.png
Related
Horizontal grouped Barplot with Bokeh using dataframe
I have a Dataframe and I want to group by Type, and then Flag and plot a graph for count of ID and another graph grouped by Type , Flag and sum of Total column in Bokeh. ') p.hbar(df, plot_width=800, plot_height=800, label='Type', values='ID', bar_width=0.4, group = ' Type', 'Flag' legend='top_right') [![Expected Graph ][2]][2] If it's not possible with Bokeh what other package can I use to get a good looking graph( Vibrant colours with white background)
You can do this with the holoviews library, which uses bokeh as a backend. import pandas as pd import holoviews as hv from holoviews import opts hv.extension("bokeh") df = pd.DataFrame({ "type": list("ABABCCAD"), "flag": list("YYNNNYNY"), "id": list("DEFGHIJK"), "total": [40, 100, 20, 60, 77, 300, 60, 50] }) # Duplicate the dataframe df = pd.concat([df] * 2) print(df) type flag id total 0 A Y 1 40 1 B Y 2 100 2 A N 3 20 3 B N 4 60 4 C N 5 77 5 C Y 6 300 6 A N 7 60 7 D Y 8 50 Now that we have our data, lets work on plotting it: def mainplot_hook(plot, element): plot.state.text( y="xoffsets", x="total", text="total", source=plot.handles["source"], text_align="left", y_offset=9, x_offset=5 ) def sideplot_hook(plot, element): plot.state.text( y="xoffsets", x="count", text="count", source=plot.handles["source"], text_align="left", y_offset=9, x_offset=5 ) # Create single bar plot for sum of the total column total_sum = df.groupby(["type", "flag"])["total"].sum().reset_index() total_sum_bars = hv.Bars(total_sum, kdims=["type", "flag"], vdims="total") # Create our multi-dimensional bar plot all_ids = sorted(df["id"].unique()) counts = df.groupby(["type", "flag"])["id"].value_counts().rename("count").reset_index() id_counts_hmap = hv.Bars(counts, kdims=["type", "flag", "id"], vdims="count").groupby("type") main_plot = (total_sum_bars .opts(hooks=[mainplot_hook], title="Total Sum", invert_axes=True) ) side_plots = ( id_counts_hmap .redim.values(id=all_ids, flag=["Y", "N"]) .redim.range(count=(0, 3)) .opts( opts.NdLayout(title="Counts of ID"), opts.Bars(color="#1F77B4", height=250, width=250, invert_axes=True, hooks=[sideplot_hook])) .layout("type") .cols(2) ) final_plot = main_plot + side_plots # Save combined output as html hv.save(final_plot, "my_plot.html") # Save just the main_plot as html hv.save(main_plot, "main_plot.html") As you can see, the code to make plots in holoviews can be a little tricky but it's definitely a tool I would recommend you pick up. Especially if you deal with high dimensional data regularly, it makes plotting it a breeze once you get the syntax down.
Pandas .describe() returns wrong column values in table
Look at the gld_weight column of figure 1. It is throwing off completely wrong values. The btc_weight + gld_weight should always adds up to 1. But why is the gld_weight column not corresponding to the returned row values when I used the describe function? Figure 1: Figure 2: Figure 3: This is my source code: import numpy as np import pandas as pd from pandas_datareader import data as wb import matplotlib.pyplot as plt assets = ['BTC-USD', 'GLD'] mydata = pd.DataFrame() for asset in assets: mydata[asset] = wb.DataReader(asset, data_source='yahoo', start='2015-1-1')['Close'] cleandata = mydata.dropna() log_returns = np.log(cleandata/cleandata.shift(1)) annual_log_returns = log_returns.mean() * 252 * 100 annual_log_returns annual_cov = log_returns.cov() * 252 annual_cov pfolio_returns = [] pfolio_volatility = [] btc_weight = [] gld_weight = [] for x in range(1000): weights = np.random.random(2) weights[0] = weights[0]/np.sum(weights) weights[1] = weights[1]/np.sum(weights) weights /= np.sum(weights) btc_weight.append(weights[0]) gld_weight.append(weights[1]) pfolio_returns.append(np.dot(annual_log_returns, weights)) pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights)))) pfolio_returns pfolio_volatility npfolio_returns = np.array(pfolio_returns) npfolio_volatility = np.array(pfolio_volatility) new_portfolio = pd.DataFrame({ 'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight })
I'am not 100% sure i got your question correctly, but an issue might be, that you are not reassigning the output to new variable, therefore not saving it. Try to adjust your code in this matter: new_portfolio = new_portfolio.sort_values(by="Returns") Or turn inplace parameter to True - link
Short answer : The issue at hand was found in the for-loop were the initial weight value normalization was done. How its fixed: see update 1 below in the answer. Background to getting the solution: At first glance the code of OP seemed to be in order and values in the arrays were fitted as expected by the requests OP made via the written codes. From testing it appeared that with range(1000) was asking for trouble because value-outcome oversight was lost due to the vast amount of "randomness" results. Especially as the question was written as a transformation issue. So x/y axis values mixing or some other kind of transformation error was hard to study. To tackle this I used static values as can be seen for annual_log_returns and annual_cov. Then I've locked all outputs for print so the values become locked in place and can't be changed further down the processing. .. it was possible that the prints of code changed during run-time because the arrays were not locked (also suggested by Pavel Klammert in his answer). After commented feedback I've figured out what OP meant with "the values are wrong. I then focused on the method how the used values, to fill the arrays, were created. The issue of "throwing wrong values was found : The use of weights[0] = weights[0]/np.sum(weights) replaces the original list weights[0] value for new weights[0] which then serves as new input for weights[1] = weights[1]/np.sum(weights) and therefore sum = 1 is never reached. The variable names weights[0] and weights[1] were then changed into 'a' and 'b' at two places directly after the creation of weights [0] and [1] values to prevent overwriting the initial weights values. Then the outcome is as "planned". Problem solved. import numpy as np import pandas as pd pfolio_returns = [] pfolio_volatility = [] btc_weight = [] gld_weight = [] annual_log_returns = [0.69, 0.71] annual_cov = 0.73 ranger = 5 for x in range(ranger): weights = np.random.random(2) weights[0] = weights[0]/np.sum(weights) weights[1] = weights[1]/np.sum(weights) weights /= np.sum(weights) btc_weight.append(weights[0]) gld_weight.append(weights[1]) pfolio_returns.append(np.dot(annual_log_returns, weights)) pfolio_volatility.append(np.sqrt(np.dot(weights.T, np.dot(annual_cov, weights)))) print (weights[0]) print (weights[1]) print (weights) #print (pfolio_returns) #print (pfolio_volatility) npfolio_returns = np.array(pfolio_returns) npfolio_volatility = np.array(pfolio_volatility) #df = pd.DataFrame(array, index = row_names, columns=colomn_names, dtype = dtype) new_portfolio = pd.DataFrame({'Returns': npfolio_returns, 'Volatility': npfolio_volatility, 'btc_weight': btc_weight, 'gld_weight': gld_weight}) print (new_portfolio, '\n') sort = new_portfolio.sort_values(by='Returns') sort_max_gld_weight = sort.loc[ranger-1, 'gld_weight'] print ('Sort:\n', sort, '\n') print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # if "999" contains the highest gld_weight... but most cases its not! sort_max_gld_weight = sort.max(axis=0)[3] # this returns colomn 4 'gld_weight' value. print ('sort max_gld_weight : "%s"\n' % sort_max_gld_weight) # this returns colomn 4 'gld_weight' value. desc = new_portfolio.describe() desc_max_gld_weight =desc.loc['max', 'gld_weight'] print ('Describe:\n', desc, '\n') print ('desc max_gld_weight : "%s"\n' % desc_max_gld_weight) max_val_gld = new_portfolio.loc[new_portfolio['gld_weight'] == sort_max_gld_weight] print('max val gld:\n', max_val_gld, '\n') locations = new_portfolio.loc[new_portfolio['gld_weight'] > 0.99] print ('location:\n', locations) Result can be for example: 0.9779586087178525 0.02204139128214753 [0.97795861 0.02204139] Returns Volatility btc_weight gld_weight 0 0.702820 0.627707 0.359024 0.640976 1 0.709807 0.846179 0.009670 0.990330 2 0.708724 0.801756 0.063786 0.936214 3 0.702010 0.616237 0.399496 0.600504 4 0.690441 0.835780 0.977959 0.022041 Sort: Returns Volatility btc_weight gld_weight 4 0.690441 0.835780 0.977959 0.022041 3 0.702010 0.616237 0.399496 0.600504 0 0.702820 0.627707 0.359024 0.640976 2 0.708724 0.801756 0.063786 0.936214 1 0.709807 0.846179 0.009670 0.990330 sort max_gld_weight : "0.02204139128214753" sort max_gld_weight : "0.9903300366638084" Describe: Returns Volatility btc_weight gld_weight count 5.000000 5.000000 5.000000 5.000000 mean 0.702760 0.745532 0.361987 0.638013 std 0.007706 0.114057 0.385321 0.385321 min 0.690441 0.616237 0.009670 0.022041 25% 0.702010 0.627707 0.063786 0.600504 50% 0.702820 0.801756 0.359024 0.640976 75% 0.708724 0.835780 0.399496 0.936214 max 0.709807 0.846179 0.977959 0.990330 desc max_gld_weight : "0.9903300366638084" max val gld: Returns Volatility btc_weight gld_weight 1 0.709807 0.846179 0.00967 0.99033 loacation: Returns Volatility btc_weight gld_weight 1 0.709807 0.846179 0.00967 0.99033 Update 1 : for x in range(ranger): weights = np.random.random(2) print (weights) a = weights[0]/np.sum(weights) # see comments below. print (weights[0]) b = weights[1]/np.sum(weights) # see comments below. print (weights[1]) print ('w0 + w1=', weights[0] + weights[1]) weights /= np.sum(weights) btc_weight.append(a) gld_weight.append(b) print('a=', a, 'b=',b , 'a+b=', a+b) The new output becomes for example: [0.37710183 0.72933416] 0.3771018292953062 0.7293341569809412 w0 + w1= 1.1064359862762474 a= 0.34082570882790686 b= 0.6591742911720931 a+b= 1.0 [0.09301326 0.05296838] 0.09301326441107827 0.05296838430180717 w0 + w1= 0.14598164871288544 a= 0.637157240181712 b= 0.3628427598182879 a+b= 1.0 [0.48501305 0.56078073] 0.48501305100305336 0.5607807281299131 w0 + w1= 1.0457937791329663 a= 0.46377503928658087 b= 0.5362249607134192 a+b= 1.0 [0.41271663 0.89734662] 0.4127166254704412 0.8973466186511199 w0 + w1= 1.3100632441215612 a= 0.31503564986069105 b= 0.6849643501393089 a+b= 1.0 [0.11854074 0.57862593] 0.11854073835784273 0.5786259314340823 w0 + w1= 0.697166669791925 a= 0.1700321364950252 b= 0.8299678635049749 a+b= 1.0 Results printed outside the for-loop: 0.1700321364950252 0.8299678635049749 [0.17003214 0.82996786]
How do I set the minimum and maximum length of dataframes in hypothesis?
I have the following strategy for creating dataframes with genomics data: from hypothesis.extra.pandas import columns, data_frames, column import hypothesis.strategies as st def mysort(tp): key = [-1, tp[1], tp[2], int(1e10)] return [x for _, x in sorted(zip(key, tp))] positions = st.integers(min_value=0, max_value=int(1e7)) strands = st.sampled_from("+ -".split()) chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()]) genomics_data = data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort)) I am not really interested in empty dataframes as they are invalid, and I would also like to produce some really long dfs. How do I change the sizes of the dataframes created for test cases? I.e. min size 1, avg size large?
You can give the data_frames constructor an index argument which has min_size and max_size options: from hypothesis.extra.pandas import data_frames, columns, range_indexes import hypothesis.strategies as st def mysort(tp): key = [-1, tp[1], tp[2], int(1e10)] return [x for _, x in sorted(zip(key, tp))] chromosomes = st.sampled_from(["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()]) positions = st.integers(min_value=0, max_value=int(1e7)) strands = st.sampled_from("+ -".split()) dfs = data_frames(index=range_indexes(min_size=5), columns=columns("Chromosome Start End Strand".split(), dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort)) Produces dfs like: Chromosome Start End Strand 0 chr11 1411202 8025685 + 1 chr18 902289 5026205 - 2 chr12 5343877 9282475 + 3 chr16 2279196 8294893 - 4 chr14 1365623 6192931 - 5 chr12 4602782 9424442 + 6 chr10 136262 1739408 + 7 chr15 521644 4861939 +
How to split a DataFrame in pandas in predefined percentages?
I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments. For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment. How would I achieve that?
Use numpy.split: a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))]) Sample: np.random.seed(100) df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE')) #print (df) a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))]) print (a) A B C D E 0 0.543405 0.278369 0.424518 0.844776 0.004719 1 0.121569 0.670749 0.825853 0.136707 0.575093 2 0.891322 0.209202 0.185328 0.108377 0.219697 3 0.978624 0.811683 0.171941 0.816225 0.274074 print (b) A B C D E 4 0.431704 0.940030 0.817649 0.336112 0.175410 5 0.372832 0.005689 0.252426 0.795663 0.015255 6 0.598843 0.603805 0.105148 0.381943 0.036476 7 0.890412 0.980921 0.059942 0.890546 0.576901 8 0.742480 0.630184 0.581842 0.020439 0.210027 9 0.544685 0.769115 0.250695 0.285896 0.852395 print (c) A B C D E 10 0.975006 0.884853 0.359508 0.598859 0.354796 11 0.340190 0.178081 0.237694 0.044862 0.505431 12 0.376252 0.592805 0.629942 0.142600 0.933841 13 0.946380 0.602297 0.387766 0.363188 0.204345 14 0.276765 0.246536 0.173608 0.966610 0.957013 15 0.597974 0.731301 0.340385 0.092056 0.463498 16 0.508699 0.088460 0.528035 0.992158 0.395036 17 0.335596 0.805451 0.754349 0.313066 0.634037 18 0.540405 0.296794 0.110788 0.312640 0.456979 19 0.658940 0.254258 0.641101 0.200124 0.657625
Creating a dataframe with 70% values of original dataframe part_1 = df.sample(frac = 0.7) Creating dataframe with rest of the 30% values part_2 = df.drop(part_1.index)
I've written a simple function that does the job. Maybe that might help you. P.S: Sum of fractions must be 1. It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2]) np.random.seed(100) df = pd.DataFrame(np.random.random((99,4))) def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42): assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs)) remain = df.index.copy().to_frame() res = [] for i in range(len(fracs)): fractions_sum=sum(fracs[i:]) frac = fracs[i]/fractions_sum idxs = remain.sample(frac=frac, random_state=random_state).index remain=remain.drop(idxs) res.append(idxs) return [df.loc[idxs] for idxs in res] train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation] print(train.shape, test.shape, val.shape) outputs: (79, 4) (10, 4) (10, 4)
Missing value imputation in Python
I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}. Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this? This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code. # beta (1x1.3M) csr matrix # num_clusters = 1000 # item_clusters (1x1.3M) numpy.array # clust_to_items (1000x1.3M) csr_matrix alpha_z = [] for clust in range(0, num_clusters): alpha = clust_to_items[clust, :] alpha_beta = beta.multiply(alpha) sum_row = alpha_beta.sum(1)[0, 0] num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001 to_impute = sum_row / num_nonzero Z = np.repeat(to_impute, beta.shape[1]) alpha_z = alpha.multiply(Z) idx = beta.nonzero() alpha_z[idx] = beta.data interact_score = alpha_z.tolist()[0] # The interact_score is the required modified beta # This is used to do some work that is very fast The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate. Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way: def fast_impute(num_clusters, item_clusters, beta): # get counts cluster_counts = np.zeros(num_clusters) np.add.at(cluster_counts, item_clusters, 1) # get complete totals totals = np.zeros(num_clusters) np.add.at(totals, item_clusters, beta) # get number of zeros zero_counts = np.zeros(num_clusters) z = beta == 0 np.add.at(zero_counts, item_clusters, z) # non-zero means cluster_means = totals / (cluster_counts - zero_counts) # perform imputations imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters]) return imputed_beta which gives me >>> N = 10**6 >>> num_clusters = 1000 >>> item_clusters = np.random.randint(0, num_clusters, N) >>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters)) >>> %time imputed = fast_impute(num_clusters, item_clusters, beta) CPU times: user 652 ms, sys: 28 ms, total: 680 ms Wall time: 679 ms and >>> imputed[:5] array([ 1.27582017, -1. , -1. , 1. , 3. ]) >>> item_clusters[:5] array([506, 968, 873, 179, 269]) >>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0]) 1.2758201701093561 Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas: >>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters}) >>> df.head() beta cluster 0 0 506 1 -1 968 2 -1 873 3 1 179 4 3 269 >>> df["beta"] = df["beta"].replace(0, np.nan) >>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean")) >>> df.head() beta cluster 0 1.27582 506 1 -1.00000 968 2 -1.00000 873 3 1.00000 179 4 3.00000 269
My suspicion is that alpha_beta = beta.multiply(alpha) is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken: sum_row = alpha_beta.sum(1)[0, 0] So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.