garchFit uses 4 degrees of freedom by default, specified in the shape parameter, so I'd say if you didn't set it to be estimated with include.shape=TRUE, the degrees of freedom are the default, fixed at 4.

python,r,statistics,statsmodels

This link says it is determined using repeated KPSS tests. I see no reason why it couldn't be implemented in Python, it would just need to be written. Otherwise, you could use rpy2 and just call auto.arima from python. from rpy2 import * import rpy2.robjects as RO RO.r('library(forecast)') # use...

python,table,statistics,beautifulsoup,screen-scraping

I think this is more along the lines of what you are looking for. You can't filter the year like you were trying to do, you have to have an if statement and filter it out yourself. from bs4 import BeautifulSoup from urllib import urlopen url = 'http://www.nfl.com/player/tombrady/2504211/careerstats' html =...

Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps...

You could do: df <- data.frame(V1 = c("adad131341", "adadar45365", "cavsbsb425", "daadvsv46567567")) library(dplyr) library(stringr) df %>% mutate(V2 = str_extract(V1, "[0-9]+"), V3 = str_extract(V1, "[aA-zZ]+")) Which gives: # V1 V2 V3 #1 adad131341 131341 adad #2 adadar45365 45365 adadar #3 cavsbsb425 425 cavsbsb #4 daadvsv46567567 46567567 daadvsv ...

python,statistics,scipy,probability

(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying...

matlab,statistics,distribution

If I understand correctly, you are asking how to decide which distribution to choose once you have a few fits. There are three major metrics (IMO) for measuring "goodness-of-fit": Chi-Squared Kolmogrov-Smirnov Anderson-Darling Which to choose depends on a large number of factors; you can randomly pick one or read the...

process,statistics,point,spatstat

For point processes the distinction between marks and covariates are: Marks are values attached to each point (settlement) and often the mark is not meaningful at other locations. Covariates are conceptually meaningful/available throughout the entire survey region (observation window). Mark values can in principle be anything, but basically only two...

You can calculate the p-values by group and then subset in geom_smooth (per the commenters): # Determine p-values of regression p.vals = sapply(unique(d$z), function(i) { coef(summary(lm(y ~ x, data=d[z==i, ])))[2,4] }) plt <- ggplot(d) + aes(x=x, y=y, color=z) + geom_point() # Select only values of z for which regression p-value...

linux,bash,awk,statistics,variance

Standard deviation formula is described in http://www.mathsisfun.com/data/standard-deviation.html So basically you need to say: for i in items sum += [(item - average)^2]/#items Doing it in your sample input: 5 av=5/1=5 var=(5-5)/1=0 5 av=10/2=5 var=(5-5)^2+(5-5)^2/2=0 5 av=15/3=5 var=3*(5-5)^2/3=0 10 av=25/4=6.25 var=3*(5-6.25)^2+(10-6.25)^2/4=4.6875 So in awk we can say: $ awk 'BEGIN {FS=OFS=","}...

The broom package can return confidence intervals for regression model estimates. require(broom) A <- c(12,11,12,15,13,16,13,18,11,14) B <- c(50,51,62,45,63,76,53,68,51,74) model <- lm(A~B) tidy(model, conf.int = TRUE, conf.level = 0.99) term estimate std.error statistic p.value conf.low conf.high 1 (Intercept) 6.8153948 3.75608761 1.814493 0.1071515 -5.78773401 19.418524 2 B 0.1127252 0.06240674 1.806299 0.1085031 -0.09667358...

python,numpy,statistics,scipy,missing-data

You can not mathematically perform a test statistic based on nan. Unless you find proof/documentation of special treatment of nan, you can not rely on that. My experience is that in general, even numpy does not treat nan specially, for example for median. Instead the results are whatever they happen...

r,excel,statistics,dataset,google-adwords

Try library(data.table) setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword] # Keyword Clicked Converted # 1: SampleName 1 1 # 2: SampleName 1 1 # 3: SampleName 1 0 # 4: SampleName 1 0 # 5: SampleName 1 0 # 6: SampleName 0 0 # 7: SampleName 0 0...

javascript,r,plot,statistics,shiny

Answering my own question since, after all, I have found some resources that fit my use case and they seem viable for development. Hopefully it'll come in handy for the comunity later down the road :) After further investigation, I found the name of "pictogram charts" as an alternative way...

python,matlab,machine-learning,statistics,random-forest

The Problem There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results. First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often...

machine-learning,statistics,classification,multilabel-classification

There are several available metrics, described in the following paper: Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437. See Table 3 on page 4 (430) - it contains brief description and formula for 8 metrics; choose the...

algorithm,math,statistics,dynamic-programming

I think you are looking for correlation coefficient remember last N values from both streams cyclic buffers are ideal for this compute the correlation coefficient detect the similarity drop for example like this: if (correlation_coefficient>-0.997) return "drop below 99.7%"; ...

r,file-io,statistics,contingency

You can use the package ff for this which uses the hard disk drive instead of RAM but it is implemented in a way that it doesn't make it (significantly) slower than the normal way R uses RAM. This if from the package description: The ff package provides data structures...

statistics,wolfram-mathematica,normal-distribution,cdf

1) MultinormalDistribution is now built in, so don't load MultivariateStatistics it unless you are running version 7 or older. If you do you'll see MultinormalDistribution colored red indicating a conflict. 2) this works: sig = .5; u = .5; dist = MultinormalDistribution[{0, 0}, sig IdentityMatrix[2]]; delta = CDF[dist, {xx, yy}]...

python,statistics,scipy,normal-distribution,cdf

edit: you actually need import norm from scipy.stats. I found the answer. You need to use ppf in scipy.stats which stands for "percent point function". So let's say you have a normal distribution with stdDev = 1, and mean = 0 and you want to find the value at which...

Here a solution based on the excellent foverlaps of the data.table package. library(data.table) ## coerce characters to dates ( numeric) setDT(x)[,c("date1","date2"):=list(as.Date(date1,"%d/%m/%Y"), as.Date(date2,"%d/%m/%Y"))] ## and a dummy date since foverlaps looks for a start,end columns setDT(y)[,c("date1"):=as.Date(date,"%d/%m/%Y")][,date:=date1] ## y must be keyed setkey(y,id,date,date1) foverlaps(x,y,by.x=c("id","date1","date2"))[, list(id,i.date1,date2,date,price)] id i.date1 date2 date price 1: A...

This is a problem known as Longest Increasing Subsequence and there are O(n log n) algorithms for it. To find the percentage, you just have to find LIS(V)/length(V). Here's an example implementation (O(n^2)) in Python EDIT: changed the code to clearly point where an O(n) step can be turned into...

r,statistics,time-series,correlation,xts

What about using rollapply in different way? As you dont supply the complete dataset, here a demonstration how I mean it: set.seed(123) m <- matrix(rnorm(100), ncol = 10) rollapply(1:nrow(m), 5, function(x) cor.mean(m[x,])) [1] -0.080029692 -0.038168840 -0.058443824 0.005699772 -0.014459878 -0.021569173 As I just figured out, you can also use the function...

r,statistics,probability,prediction,calibration

The warning is telling you that predict.gam doesn't recognize the value you passed to the type parameter. Since it didn't understand, it decided to use the default value of type, which is "terms". Note that predict.gam with type="terms" returns information about the model terms, not probabilties. Hence the output values...

java,statistics,boxplot,random-sample

Supposed that min, a, median, b, max values separate quartiles of distribution (http://en.wikipedia.org/wiki/Quartile): static public double next(Random rnd, double median, double a, double b, double min, double max) { double d = -3; while (d > 2.698 || d < -2.698) { d = rnd.nextGaussian(); } if (Math.abs(d) < 0.6745)...

matlab,plot,statistics,outliers

This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable. So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast...

I don't think the issue here really concerns options available with TimeGrouper, but rather, how you want to deal with uneven data. You basically have 4 options that I can think of: 1) Drop enough observations (at the start or end) such that you have a multiple of 2 years...

python,statistics,scipy,p-value

Re what happens here internally. Well, the Student t distribution is defined for dof > 0, at least in scipy.stats: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.t.html. Hence a nan: In [11]: stats.t.sf(-11, df=10) Out[11]: 0.99999967038443183 In [12]: stats.t.sf(-11, df=-10) Out[12]: nan ...

python,numpy,statistics,mean,weighted

You actually have 2 different questions. How to make data discrete, and How to make a weighted average. It's usually better to ask 1 question at a time, but anyway. Given your specification: xmin = -100 xmax = 100 binsize = 20 First, let's import numpy and make some data:...

c#,linq,dictionary,io,statistics

I'd suggest to create custom class which can hold/store related data. Let it be Statisztika with the following fields/properties: Day, PersonId, Visitor and CountOfVisits. Statisztika class definition: public class Statisztika { private int iday = 0; private int ipersonid = 0; private int ivisitor =0; private int icount =0; //class...

r,statistics,correlation,vegan

You want the anova() method that vegan provides for cca(), the function that does CCA in the package, if you want to test effects in a current model. See ?anova.cca for details and perhaps the by = "margin" option to test marginal terms. To do stepwise selection you have two...

On Windows: Create a batch file; which will start both Apache and Tomcat services one after another. See below: sc start MyService ...

python,statistics,linear-regression,statsmodels

A linear hypothesis has the form R params = q where R is the matrix that defines the linear combination of parameters and q is the hypothesized value. In the simple case where we want to test whether some parameters are zero, the R matrix has a 1 in the...

python,numpy,statistics,minitab

Here's an attempt to implement Minitab's algorithm. I've written these functions assuming that you've already dropped missing observations from the series a: # Drop missing obs x = df.aquatic[~ pd.isnull(df.aquatic)] def get_quartile1(a): a = a.sort(inplace=False) pos1 = (len(a) + 1) / 4.0 round_pos1 = int(np.floor((len(a) + 1) / 4.0)) first_part...

Let's see this through an example: First of all you need to specify another argument on the trainControl function, returnResamp='all' so that it returns info on all resamples. Example data: #classification example y <- rep(c(0,1), c(25,25)) x1 <- runif(50) x2 <- runif(50) df <- data.frame(y,x1,x2) Solution: Your code should be...

python,numpy,statistics,hdf5,h5py

In case anybody else stumbles across this: The way I solved this was to first extract all p-values that had a chance of passing the FDR correction threshold (I used 1e-5). Memory-consumption was not an issue for this, since I could just iterate through the list of p-values on disk....

python,statistics,scikit-learn,statsmodels,cvxopt

statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial. http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html and there should be some examples available in the documentation or unit tests. It uses an interior...

python,numpy,statistics,scipy,nested-lists

You need to apply it on a numpy.array reflecting the nested lists. from scipy import stats import numpy as np dataset = np.array([[1.5,3.3,2.6,5.8],[1.5,3.2,5.6,1.8],[2.5,3.1,3.6,5.2]]) stats.mstats.zscore(dataset) works fine....

Try this: > m <- lm(B~A) > predict(m, newdata=data.frame(A=14), interval='confidence', level=0.9) fit lwr upr 1 60.58495 54.72854 66.44135 ...

r,data,statistics,analytics,outliers

I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it...

Tabulating and collapsing Your example vector is vec <- letters[c(1,2,2,2,3,3,4,5,6)] To get a tabulation, use tab <- table(vec) To collapse infrequent items (say, with counts below two), use res <- c(tab[tab>=2],other=sum(tab[tab<2])) # b c other # 3 2 4 Displaying in two columns resdf <- data.frame(count=res) # count # b...

As I stated in the comments; generate random number with RAND from 0 to 1, compare with the probability. If it is bigger then it is 0, else 1. =IF(RAND()>=A1,0,1) ...

linux,unix,statistics,monitoring,proc

There surely is a .c file and it's a part of the Linux kernel. If you really want to see how it's done you can start unwinding it e.g. from here: http://lxr.free-electrons.com/source/block/genhd.c?v=3.8 Reading from procfs is not the worst method to get the stats, actually that's what it's made for....

You could use apply to generate a list of the Fisher test results: tests <- apply(data2, 1, function(x) fisher.test(matrix(na.omit(x), nrow=2, byrow=TRUE))) Then you could access row-specific tests with standard list indexing tests[[4]] # Fisher's Exact Test for Count Data # # data: matrix(na.omit(x), nrow = 2, byrow = TRUE) #...

r,statistics,mathematical-optimization,minimization

On this occasion optim will not work obviously because you have equality constraints. constrOptim will not work either for the same reason (I tried converting the equality to two inequalities i.e. greater and less than 15 but this didn't work with constrOptim). However, there is a package dedicated to this...

Here's an example which uses random-fu: import Data.Random -- for randomness import Text.Printf -- for printf import Data.Foldable -- for the for_ loop -- pdf and cdf are basically “Distribution -> Double -> Double” main = do -- defining normal distribution with mean = 10 and variation = 2 let...

r,statistics,normal-distribution

Since both T1 and T2 rely on X1, X2, Y1, and Y2, you should first simulate those four random variables: X1 <- rnorm(1e4, mu1, sigma) X2 <- rnorm(1e4, mu1, sigma) Y1 <- rnorm(1e4, mu2, sigma) Y2 <- rnorm(1e4, mu2, sigma) Then you can run your code to get all simulated...

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though. I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package: require(data.table) subject <- data.table(interval = paste("int", 1:4, sep=""), start = c(2,10,12,25), end = c(7,14,18,28)) query...

This is not an example of a Lagrange multiplier, and the two equations are not equivalent. However, the paper doesn't claim this: the text states that formula (5) is "modified" to get formula (6). Using a Lagrange multiplier would lead to a coupled system of two equations. Note how formula...

r,statistics,cluster-analysis,k-means

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case). To understand how accurate your clustering algorithm is at...

statistics,spring-boot,ehcache

If you can afford to use a snapshot release of spring boot this feature is being added to 1.3.0. Right now you won't get that in 1.2.X

In the mstats module of scipy.stats, "missing values" are handled using a masked array. nan does not indicate a missing value. The following shows how you can convert your array y (which uses nan for missing values) into a masked array my: In [48]: x = np.arange(12) In [49]: y...

r,math,statistics,time-series,forecasting

You seem to be confused between modelling and simulation. You are also wrong about auto.arima(). auto.arima() does allow exogenous variables via the xreg argument. Read the help file. You can include the exogenous variables for future periods using forecast.Arima(). Again, read the help file. It is not clear at all...

r,statistics,classification,decision-tree,rweka

In order to prune the tree with PART you need to specify it in the control argument of the function: There is a complete list of the commands you can pass into the control argument here I quote some of the options here which are relevant to pruning: Valid options...

There are many ways. I would prefer via a data.table. First convert your data into a data.table: require(data.table) #tested in data.table 1.9.4 setDT(mydata) > mydata Fruit.Type Year Primary.Wgt Primary.Loss.PCT Retail.Wgt Retail.Loss.PCT 1: Oranges.F 1970 16.16 3 15.68 11.6 2: Oranges.F 1971 15.73 3 15.26 11.6 3: Oranges.F 1972 14.47 3...

The signature for the function is args(bpower) # function (p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha = 0.05) so if unnamed, the third parameter will be interpreted as the odds ratio. So yes, you made a mistake in your code. You should explicitly name your parameters to avoid this...

See these links: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html http://www.evanmiller.org/bayesian-average-ratings.html http://www.evanmiller.org/ranking-items-with-star-ratings.html on closer inspection of last link, he is just calculating mean - standard error of the mean and using that for ranking. ...

variables,statistics,sas,conditional-statements,proc

I wouldn't create a separate data step as suggested by Alex A. That can be a bad habit to develop as, with large datasets, it can be extremely costly in terms of CPU. Rather, I would subset the Proc Means call but slightly differently from Alex A's suggestion since you...

machine-learning,statistics,linear-regression

So Linear Regression assumes your data is linear even in multiple dimensions. It wont be possible to visualize high dimensional data unless you use some methods to reduce the high dimensional data. PCA can do that but bringing it down to 2 dimensions won't be helpful. You should do Cross...

r,loops,statistics,data.frame,regression

here is a quick rewrite of your code, this should give you what you are looking for. Assigning a value of each column is unnecessary since myData should be a data.frame, as such you can access each column with it's column name. rm(list=ls()) myData <-read.csv(file="C:/Users/Documents/myfile.csv",header=TRUE, sep=",") for(i in names(myData)) {...

algorithm,math,statistics,variance,standard-deviation

Given the forward formulas Mk = Mk-1 + (xk – Mk-1) / k Sk = Sk-1 + (xk – Mk-1) * (xk – Mk), it's possible to solve for Mk-1 as a function of Mk and xk and k: Mk-1 = Mk - (xk - Mk) / (k - 1)....

Weather Underground has free historical data. Here's data for Helsinki from January 1, 2015 through June 18, 2015 (for example). You can customize the data range and manually download as a CSV file.

javascript,algorithm,math,statistics,distribution

You mention a logarithmic distribution, but it looks like your code is designed to generate a truncated geometric distribution instead, although it is flawed. There is more than one distribution called a logarithmic distribution and none of them are that common. Please clarify if you really do mean one of...

Thanks to Roland You can get the model string by accessing the summary attributes (have a look at str(s)) s <- summary(fit) mymod <- paste(attr(s$terms, "term.labels"), collapse=" + ") mymod [1] "x1 + x2 + x1:x2" However, you can get the data by passing the model fit to model.matrix model.matrix(fit)...

Assuming your time were a variable "X", you can use round or trunc. Try: round(X, "hour") trunc(X, "hour") This would still require some work to determine whether the values had actually been rounded up or down (for round). So, If you don't want to have to think about that, you...

matlab,random,statistics,median

Matlab function rand generates (pseudo)-random numbers uniformly distributed on the interval [0,1]. The median of this distribution is 0.5. You can make the median to be m by adding m-0.5 to each number. The function function array = generateNumbers(m, n, medianValue) array = rand(m,n)-0.5 + medianValue; end returns a random...

Try mapply(function(x,y) t.test(x,y)$p.value, as.data.frame(controls), as.data.frame(patients)) # V1 V2 V3 V4 V5 V6 V7 V8 #0.8481788 1.0000000 0.4605294 1.0000000 0.6436604 1.0000000 1.0000000 1.0000000 # V9 V10 V11 #1.0000000 1.0000000 1.0000000 assuming that "controls" and "patients" are matrix data controls <- structure(c(1253, 2311.3, 1314.83, 9.88, 12.74, 11.39, 20.8, 6.82, 18.12, 17.88, 17.88,...

First, make good use of Stata's help files: e.g., search percentiles returns a list of possible commands. Two commands that will likely be of use are summarize (with the detail option; note that you can use return list afterwards to view/store results [regardless of whether the detail option was specified])...

python,statistics,scipy,p-value

The degrees of freedom you are passing to the formula are negative. In [6]: import numpy as np from scipy.special import stdtr dof = -2176568 tf = -11.374250 2*stdtr(dof, -np.abs(tf)) Out[6]: nan If positive: In [7]: import numpy as np from scipy.special import stdtr dof = 2176568 tf...

r,list,table,statistics,do.call

This produces your desired output if I understand it correctly: sink("output.txt") for (i in seq_along(z)) { cat(names(z)[i], '\n') # print out the header write.table(z[[i]], row.names = FALSE, col.names = FALSE) } sink() I open a connection to a text file with sink then loop over your list of tables and...

python,pandas,statistics,scipy

One way to calculate this is to use apply on the groupby object: >>> import scipy.stats as st >>> df.groupby(['station_id']).apply(lambda x: st.kendalltau(x['year'], x['Sum'])) station_id 210018 (-0.2, 0.62420612399) 215400 (0.4, 0.327186890661) dtype: object ...

Using dcast from reshape2 : library(reshape2) dcast(dat,V1~V2,fill=0) V1 1 2 3 4 1 pat1 2 0 1 2 2 pat2 0 0 3 0 3 pat3 4 3 0 0 Where dat is : dat <- read.table(text='V1 V2 V3 pat1 1 2 pat1 3 1 pat1 4 2 pat2 3...

python,statistics,canopy,pysal

The problem is that the regression results instance of statsmodels is not compatible with the one in pysal. You can use breushpagan from statsmodels, which takes OLS residuals and candidates for explanatory variables for the heteroscedasticity and so it does not rely on a specific model or implementation of a...

Please read the documentation carefully. help(arima) clearly tells you that init relates to the initial values of parameters: init optional numeric vector of initial parameter values. Missing values will be filled in, by zeroes except for regression coefficients. Values already specified in fixed will be ignored. Similarly, fixed also relates...

With help from @akrun and @eddi, here's the idiomatic (?) way: mycols = c("description","date","location") setkeyv(DT0,mycols) DT1 <- DT0[J(do.call(CJ,lapply(mycols,function(x)unique(get(x)))))] # alternately: DT1 <- DT0[DT0[,do.call(CJ,lapply(.SD,unique)),.SDcols=mycols]] The identifier column is missing for the new rows, but can be filled: setkey(DT1,description) DT1[unique(DT0[,c("description","identifier"),with=FALSE]),identifier:=i.identifier] ...

r,statistics,probability-density

The efficient way to do this sort of operation is to use a convolution: convolve(a, rev(b), type="open") # [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05 This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the...

r,functional-programming,statistics

The simplest approach would probably be to use the bigsplit function and a for loop for in-place modification. idx <- bigsplit(data, 1) for(i in seq(length(idx))){ data[idx[[i]],2] <- data[idx[[i]],2] - mean_var1[i] } It appears like you will want the former but if you wanted a subset returned of a reasonable size...

c#,algorithm,data,graph,statistics

A simple way is to calculate the difference between every two neighbouring samples, eg diff= abs(y[x point 1] - y[x point 0]) and calculate the standard deviation for all the differences. This will rank the differences in order for you and also help eliminate random noise which you get if...

r,statistics,bar-chart,pie-chart

Here is a general approach: pie has a labels parameter so you can just use that, use \n for a line break, and add text under the name barplot has a return value which are the x-coordinates of each bar, so just use text along with the data (for the...

statistics,mean,angle,direction

The values don't look right Angles that differ by 360 are equivalent. So, -28.8551147 == 331.145, which is the arithmetic mean of the two values you provided. If you would like to ensure that your values are always in [0,360), you should add 360 if the values are less...

java,arrays,methods,statistics

There are two errors : First : When you do this - test.findMin(num) you are trying to pass parameter num. But num is not array! It is a number. You probably want to do this : test.findMin(array) Second : You can implicitly convert integer to double, because you can be...

This is due to the use of a "multi-step" recursion like asymmetricP05. Such a pattern allows the warping path to be composed of long segments, e.g. knight's moves. To verify the monotonicity, you should only consider the starting positions of each of the "knight's moves" - not all of the...

You can explicitly pass the value to the function by defining the FUN=function() value. Using the mtcars dataset library(pastecs) Note that the first line is an abbreviated version of the second by(mtcars$mpg, mtcars$am, stat.desc, norm = TRUE, basic = TRUE) by(mtcars$mpg, mtcars$am, function(X) stat.desc(X, norm = TRUE, basic = TRUE))...

r,machine-learning,statistics,classification

Basically, your question boils down to having some variables (Word1, Word2, and Word3 in your example) and a binary outcome (Author in your example) and wanting to know the importance of different variables in determining that outcome. A natural approach would be training a regression model to predict the outcome...

statistics,standard-deviation,weighted-average

I just found this wikipedia page discussing data of equal significance vs weighted data. The correct way to calculate the biased weighted estimator of variance is , though this on-the-fly implementation is more efficient computationally as it does not require calculating the weighted average before looping over the sum on...

Inverse scale (1/scale) is rate parameter. So if you have shape and rate you can create gamma rv with this code >>> from scipy.stats import gamma >>> rv = gamma(shape, scale = 1.0/rate) Read more about different parametrizations of Gamma distribution on Wikipedia: http://en.wikipedia.org/wiki/Gamma_distribution...

You are mistaking what the significance means in terms of the p-value. I will try to explain below: Let's assume a test about the means of two populations being equal. We will perform a t-test to test that by drawing one sample from each population and calculating the p-value. The...

statistics,genetic-algorithm,prediction,generalization

The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines: A GA will learn to do exactly...

r,plot,statistics,confidence-interval

So, the hard part of this is transforming your data into the right shape, which is why it's nice to share something that really looks like your data, not just a single column. Let's say your data is this a matrix with 10,000 rows and 10 columns. I'll just use...

matlab,statistics,matlab-guide,matlab-deployment

That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all...