gbutils overview
Table of Contents
Brief description of programs
The programs in gbutils
can be divided in four broad classes:
- Data Manipulation
- Data Transformation
- Descriptive Statistics
- Statistical Tests and Models
The basic operation is essentially the same for all programs: you feed the standard input of the program with data in ASCII format separated by spaces, tabs or newline character. In general, each input line is considered a record and the blank separated entries in each line are considered different fields. The exact way in which different records and fields are treated depends on the program and can vary accordingly to the options specified in the command line (see below).
After the program has read the data from standard input, it performs the required manipulations/analyses and prints the result to standard output, in the form of an ASCII file of newline separated records. Inside each output record, the fields are separated by spaces. Obviously, the meaning of the records and fields depend on the program.
Data Manipulation
These programs do not perform any analysis by themselves. Rather, they
are provided as an help to prepare data for subsequent analysis. In
particular, gbget is the only program that reads data from file and
not from standard input. It can be used to extract data, according to
a given pattern, from one or more files and send them, through a pipe
|
, to other utilities. This program possesses a rather complex set
of options. See README.gbget for a tutorial on its use.
- gbget
- extract data from a tabular input according to a specified pattern. It is possible to access more files at the same time, merge their contents and transpose or flatten the resulting table.
- gbfun
- compute generic functions on data in a column-wise manner. The function can be applied to all the columns or defined in a recursive way.
- gbgrid
- generate a grid (i.e. a matrix) of values according to a user specified function
- gbboot
- generate bootstrapped sequences from data provided sample
- gbrand
- generated i.i.d. pseudo random variates
- gbenv
- provide information about the numeric environment and the internal settings of the package
Data transformation
These programs perform basic transformation on input data, which are often considered preliminary to further statistical analysis.
- gbmave
- print moving statistics (average, variance, etc.) of input data
- gbinterp
- compute the interpolation on a regular mesh of user provided points. It can also print first and second derivative of the interpolation.
- gbfilternear
- filter near points in Ecuclidean metrics. Point whose distance is below a given threshold are removed.
Descriptive statistics
These utilities are useful in the representation and description of data. They encompass simple statistics and more "advanced" non parametric methods.
- gbdist
- cumulative distribution of input data
- gbstat
- simple descriptive statistics of input data
- gbbin
- compute binned statistics
- gbquant
- quantiles of the empirical distribution of input data
- gbhisto
- histogram for univariate data. Choose between absolute frequencies, relative frequencies and empirical density
- gbker
- kernel density estimate for univariate data. The type of kernel, the bandwidth and the computation method can be specified at the command line
- gbnear
- density estimate via nearest neighbors method
- gbker2d
- kernel density estimate for bivariate data
- gbhisto2d
- histograms for bivariate data
- gbgcorr
- Compute the correlation dimension of a time series with a Gaussian kernel.
- gbacorr
- It computes the autocorrelogram or the cross-autocorrelogram of a series of observations. It reads the data column-wise.
- gbxcorr
- Compute the cross-covariance and cross-correlation coefficients with and without the removal of the mean of two samples. It reads the data column-wise.
Statistical tests and models
The utilities provide statistical tests to compare different samples and non-parametric method to investigate relationship between paired (or in general compounded) observations.
- gbtest
- various one and two samples statistical tests. When available, p-score significance is also provided.
- gbmodes
- find the critical bandwidth for a kernel density estimate to generate a given number of modes and compute the associated p-value using smoothed bootstrap technique
- gbbin
- the program takes couples of values X Y (separated by spaces), bins them with respect to the first variable and prints statistics of the second variables
- gbkreg
- compute the kernel non-linear regression function
- gbkreg2d
- compute the kernel non-linear regression function on three dimensional data
- gblreg
- compute linear OLS regression
- gbglreg
- compute generalized linear OLS regression
- gbnlreg
- compute non linear regression using OLS, MAD or asymmetric MAD estimators
- gbnlqreg
- compute non linear quantile regression
- gbnlmult
- contemporaneous least square estimation of a system of non linear equations.
- gbnlprobit
- estimate a non linear probit model on binary data
- gbhill
- estimate different families of probability distribution on the extremal data using maximum likelihood.
- gblafit
- Fit Laplace density using maximum-likelihood.
- gbalafit
- Fit asymmetric Laplace density using maximum-likelihood.
- gbepfit
- Fit symmetric power exponential density using maximum-likelihood.
- gblaepfit
- Fit skewed power exponential density using maximum-likelihood.
- gbaepfit
- Fit asymmetric power exponential density using maximum-likelihood.
For more information on a specific command, use the -h command line option.
Please, notice that all the programs work by loading the whole set of data in memory before computing the relevant statistics. In this respect, they are probably not suitable to be used on very large datasets.
Understanding Input/Output
All the commands of this package read input in ASCII format. The data
should be separated by white characters (spaces or tabs) or
newlines. Lines beginning with a fence symbol #
are ignored. They
are simply skipped by the input routine.
If support for the zlib has been included at compile time (see above) the input ASCII file can be gz-compressed.
A file can contain several blocks of data. Blocks are separated by two consecutive blank lines. In general, all operations are performed on the first block found in the datafile. The program 'gbget' can be used to extract one particular block (or set of blocks) from one file.
Sequential, tabular and compounded input
The utilities in this package use three different ways of reading data from input:
- sequential
- In 'sequential' format a single dataset is internally build from the data input file. All the entries found on one column of input are read sequentially and put in the same dataset. Notice that the different lines must contain the same number of entries or NAN values are generated.
- tabular
- In 'tabular' format, each column of the input is treated as a different dataset; the program will internally create a list of datasets, one for each column of input. The different entries on one line are then put inside different sets. Notice that, in this case, the number of fields in the first non-comment non-empty line decides the number of datasets. All subsequent input lines should contain the same number of fields (but see below).
- compounded
- In 'compounded' format the program reads a fixed number of fields from each line. Each line is internally stored as an n-tuple (a couple or a triplet) and treated accordingly. Notice that if some line contains more fields than needed, the extra fields are ignored.
An example can clarify the difference between the 'sequential', 'tabular' and 'compounded' format. Suppose to have the following input datafile
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
using the 'sequential' input the unique dataset
{1.0,2.0,3.0,4.0,...,9.0}
is internally generated by the
program. Notice the ordering: all the entries of one line are inserted
in the internal dataset before the next line is red.
In 'tabular' format, the program builds instead three different
datasets: {1.0,4.0,7.0}
, {2.0,5.0,8.0}
and {3.0,6.0,9.0}
and use
each set separately for its subsequent duties (topically reproducing
the same statistical analysis for each set).
In 'compounded' format, assuming that the program accepts couples, the
following array of ordered couples is generated
{(1.0,2.0),(4.0,5.0),(7.0,8.0)}
. Notice that this is a single
dataset, made of couples of associated values.
When available, the "sequential" format is the default while the "tabular" format is activated with the option '-t'. See the Programs summary table for the list of input format accepted by the different utilities.
Missing values and NaN management
When the conversion of an input entry to an internal floating point number cannot be performed, or when, on an input line, there are not enough values for the required "tabular" or "compounded" format, a NAN (not-a-number) value is generated. This approach is introduced to make possible the manipulation of files with an uneven number of entries in different columns or with "non numerical" values.
The following utilities automatically remove the NaN
values from their
input: gbstat
, gbdist
and gbquant
.
The other utilities do handle NaN
values as expected: if NaN
values
are present they typically return NaN
output. In this case, the option
D
of the gbget
utility is provided to remove all the lines
containing NaN
entries. This program can be used in a pipe like
...| gbget '()D' | ....
to treat the data before passing them to other NaN
-sensitive
utilities.
Radix and thousands separator
In addition to the radix symbol which separates the fractional and the integer part of the number, sometimes data are reported with a thousand separator symbol. For instance "one million" could be written "1,000,000.00". The character used to separate thousands and the fractional part are defined inside the C locale. Programs in the gbutils package can automatically recognize the locale settings and process these entries accordingly. Please use "gbenv" to see the definitions in use. Changing the locale typically amount simply to the redefinition of the LANG environment variable
# export LANG="en_US"
A list of the available locale can be obtained with the locale program
# locale -a
and the actual setting verified with
# locale
For more details refer to the locale documentation.
Output format and precision
In general, the output from the different programs is made of newline
separated records of space separated fields of standard ASCII
characters, which represent floating point numbers. The default format
is scientific notation with a precision of six digits. The format and
the precision can be changed using the environment variable
GB_OUT_FLOAT_FORMAT
. This variable can be set to any printf
(the
standard library C function) meaningful string. For instance with
# export GB_OUT_FLOAT_FORMAT="%.8e"
the precision is extended to eight digit. While with
# export GB_OUT_FLOAT_FORMAT="%.fe"
the scientific notation is replaced with a fixed-point
notation. Please, refer to the printf
documentation for further
details.
There is also a second variable, GB_OUT_EMPTY_FORMAT
, which can be
used to tune the comment headings that many programs generate with the
verbose option -v
. Notice that it is automatically set to a value
which is consistent with the float format chosen, so in general it is
a good idea not to change it explicitly.
Functional expressions
The functional expressions used in various utilities, like gbget
,
gbnlreg
, or gbfun
, are interpreted via the matheval library.
Supported constants are (names that should be used are given in
parenthesis): e (e)
, log2(e) (log2e)
, log10(e) (log10e)
, ln(2)
(ln2)
, ln(10) (ln10)
, pi (pi)
, pi/2 (pi_2)
, pi/4 (pi_4)
,
1/pi (1_pi)
, 2/pi (2_pi)
, 2/sqrt(pi) (2_sqrtpi)
, sqrt(2)
(sqrt2)
and sqrt(1/2) (sqrt1_2)
.
Supported elementary functions are (names that should be used are
given in parenthesis): exponential (exp)
, logarithmic (log)
,
square root (sqrt)
, sine (sin)
, cosine (cos)
, tangent (tan)
,
cotangent (cot)
, secant (sec)
, cosecant (csc)
, inverse sine
(asin)
, inverse cosine (acos)
, inverse tangent (atan)
, inverse
cotangent (acot)
, inverse secant (asec)
, inverse cosecant
(acsc)
, hyperbolic sine (sinh)
, cosine (cosh)
, hyperbolic
tangent (tanh)
, hyperbolic cotangent (coth)
, hyperbolic secant
(sech)
, hyperbolic cosecant (csch)
, hyperbolic inverse sine
(asinh)
, hyperbolic inverse cosine (acosh)
, hyperbolic inverse
tangent (atanh)
, hyperbolic inverse cotangent (acoth)
, hyperbolic
inverse secant (asech)
, hyperbolic inverse cosecant (acsch)
,
absolute value (abs)
, Heaviside step function (step)
with value 1
defined for x = 0, Dirac delta function with infinity (delta)
and
not-a-number (nandelta)
values defined for x = 0, and error function
(erf)
.
Supported unary operation is unary minus ('-').
Supported binary operations are addition ('+'), subtraction ('-') multiplication ('*'), division multiplication ('/') and exponentiation ('^').
Usual mathematical rules regarding operation precedence apply. Parenthesis ('(' and ')') could be used to change priority order.
Numerical Error handling
The default behaviour of Gnu Scientific Library functions is to abort
the execution of the program if a numeric error is produced. Some of
these errors, especially underflow errors, are tolerable inside a
computation. The 'gbutils' package provides a way of switching off the
GSL error handling. It is sufficient to set the environment variable
GB_ERROR_HANDLER_OFF
using
# export GB_ERROR_HANDLER_OFF=
and all the programs will ignore numerical errors. This feature must be used carefully, after checking that the loss of precision implied by the presence of these errors can be considered tolerable for the actual computation one wants to perform. The default behaviour can be recovered using
# unset GB_ERROR_HANDLER_OFF
Binary format
THIS IS AN EXPERIMENTAL FEATURE
Like ASCII files, the binary files are structured as sequences of separate blocks. Each block is made of
- one
size_t
with number of columns C - C
size_t
with the length of the rows, R1 … RC - the data stored sequentially column by column, for a total number equal to R1+R2+…+RC
This structure allows the storage of non matrices structures in binary format. If lengths are different, the missing values are replaced with NANs. This mimic the behaviour of ASCII data handling.
Notice that blocks are simply written one after the other. No particular separators are inserted between them.
Implementation: the option -b
redefines the function used to read
and/or write data.
This feature has been implemented for gbget
, gbmstat
and gbfun
.
Graphic output
As previously mentioned, the output of many programs in the gbutils package, like gbhisto or gbker, is intended to be plotted and not directly read from the terminal. It is generally composed of records and fields of standard ASCII characters. This type of output can be displayed using the various plotting utilities commonly available in Unix systems. We shortly review below three possibilities.
GNU plotutils package
The plotutils package can be found here. It contains the program
graph
which generate a plot starting from input data. For example to
obtain a plot of the kernel density of the data in file datafile.dat
one can use
gbker < datafile.dat | graph -T x
where -T x
choose xwindow as output device.
Gnuplot interactive session
An alternative is to use the powerful plotting environment provided by gnuplot. The program can be found here.
From inside a gnuplot session, the previous kernel density can be obtained with
plot "< gbker < datafile.dat "
see Gnuplot documentations for details.
Gnuplot's plot from command line
As the example above shows, in order to plot the output of a command
inside gnuplot you need to put it inside an expression delimited by "<
and ". Moreover, all double quotes " have to be escaped with a
backslash \. These requirements can lead to cumbersome expressions
when complicated commands are necessary. Moreover, starting an
interactive gnuplot session and writing the expression whose output
should be plotted doesn't seem so attractive when one needs fast,
simple plotting, for exploratory purposes. These are the cases foe
qhich the command gbplot
is provided. This is a shell script that
accept the data to be plotted as input and the directive on how to
plot it on the command line.
The basic usage is as follows
gbplot [options] [plot|splot] <plotting options> < datafile
or
command pipe | gbplot [options] [plot|splot] <plotting options>
The command plot
or splot
are required. One can provide further
plotting options by inserting them after these command. For example
one can plot the kernel density estimate using
gbker < datafile.dat | gbplot plot
In this way the density is plotted using simple points. To use the fancier gnuplot's 'histeps' style use instead
gbker < datafile.dat | gbplot plot with histeps
The syntax of the plotting options is exactly the same that would be
used inside gnuplot, after a the plot
or splot
command. For
instance to specify a range for the x values use
gbker < datafile.dat | gbplot plot '[-1:1]' with histeps
It is also possible to obtain multiple plots of the data using the gnuplot special file name '""', as in
gbker < datafile.dat | gbplot plot 'w p , "" w l'
This command draws the kernel estimate two times: the first with
points, the second with a line (as specified by the w l
expression).
gbplot
also possesses several options. They must be specified before
the plot
or splot
command. To insert a title in the plot use the
option -t
gbker < datafile.dat | gbplot -t Title plot with histeps
Terminal type and output file can be specified with the -T
and -o
options respectively. The command
gbker < datafile.dat | gbplot -T pdf -o output.pdf plot with histeps
produce a pdf version fo the plot and save it in 'output.pdf'.
Finally, if an interactive manipulation of plot parameters or data is
required, you can use the option -i
. This option opens an
interactive gnuplot session, allowing for direct manipulation of plot
settings and parameters
gbker < datafile.dat | gbplot -i plot with histeps
Once the session is closed, the output is saved in a file using a
specific terminal if options -o
and -T
have been specified.
Programs summary table
Name | Input Type | External lib | NAN | |
---|---|---|---|---|
gbget | c+ | (matheval) | * | |
gbfun | c+ | matheval | ||
gbgrid | no | matheval | ||
gbrand | no | gsl | ||
gbboot | s,t,c+ | (gsl) | ||
gbenv | no | |||
gbmave | s,t | * | ||
gbinterp | c,2 | gsl | ||
gbfilternear | c+ | |||
gbdist | s,t | * | ||
gbstat | s,t | * | ||
gbquant | s,t,c+ | * | ||
gbhisto | s,t | |||
gbker | s | gsl | ||
gbnear | s | |||
gbhisto2d | c2 | |||
gbgcorr | s | |||
gbacorr | c1,c2 | |||
gbxcorr | c2 | |||
gbker2d | c2 | |||
gbbin | c+ | |||
gbtest | c+ | (gsl) | * | |
gbmodes | s | gsl | ||
gbbin | t | |||
gbkreg | c2 | gsl | ||
gbkreg2d | c3 | |||
gblreg | c2 | gsl | ||
gbglreg | c+ | gsl | ||
gbnlreg | c+ | gsl,matheval | ||
gbnlqreg | c+ | gsl,matheval | ||
gbhill | s | gsl | ||
gbnlmult | c+ | gsl,matheval | ||
gbnlprobit | c+ | gsl,matheval | ||
gbnlpanel | c+ | gsl,matheval | * | |
gbnlpolyit | c+ | gsl,matheval | ||
gblafit | s | gsl | ||
gbalafit | s | gsl | ||
gbepfit | s | gsl | ||
gblaepfit | s | gsl | ||
gbaepfit | s | gsl |
Input Type: 's' sequential; 't' tabular; 'c' compounded c2 read couples, c3 triplets, c+ a variable number of columns; 'no' no input required
External libs: gsl: Gnu Scientific Library matheval: GNU matheval library () means optional dependence (special features are available only if the library is found)
NAN: program automatically ignores NAN values in computations