Home > Archive > PostgreSQL Patches > October 2006 > array_accum aggregate









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author array_accum aggregate
Stephen Frost

2006-10-25, 8:29 am

Greetings,

Please find below a patch to add the array_accum aggregate as a
built-in using two new C functions defined in array_userfuncs.c.
These functions simply expose the pre-existing efficient array
building routines used elsewhere in the backend (accumArrayResult
and makeArrayResult, specifically). An array_accum aggregate has
existed in the documentation for quite some time using the
inefficient (for larger arrays) array_append routine. The
documentation around the example has also been updated to reflect
the addition of this built-in.

Documentation and a regression test are also included.

Thanks,

Stephen

Index: doc/src/sgml/func.sgml
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/func.sgml,v
retrieving revision 1.343
diff -c -r1.343 func.sgml
*** doc/src/sgml/func.sgml 1 Oct 2006 18:54:31 -0000 1.343
--- doc/src/sgml/func.sgml 11 Oct 2006 04:38:45 -0000
***************
*** 7851,7856 ****
--- 7851,7872 ----
<row>
<entry>
<indexterm>
+ <primary>array_accum</primary>
+ </indexterm>
+ <function>array_accum(<replaceable class="parameter">anyelement</replaceable> )</function>
+ </entry>
+ <entry>
+ <type>anyelement</type>
+ </entry>
+ <entry>
+ array of elements of same type as argument type
+ </entry>
+ <entry>an array of all input elements (NULLs, non-nulls, and duplicates)</entry>
+ </row>
+
+ <row>
+ <entry>
+ <indexterm>
<primary>average</primary>
</indexterm>
<function>avg(<replaceable class="parameter">expression</replaceable> )</function>
Index: doc/src/sgml/xaggr.sgml
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/xaggr.sgml,v
retrieving revision 1.33
diff -c -r1.33 xaggr.sgml
*** doc/src/sgml/xaggr.sgml 16 Sep 2006 00:30:16 -0000 1.33
--- doc/src/sgml/xaggr.sgml 11 Oct 2006 04:38:45 -0000
***************
*** 132,138 ****
</programlisting>

Here, the actual state type for any aggregate call is the array type
! having the actual input type as elements.
</para>

<para>
--- 132,141 ----
</programlisting>

Here, the actual state type for any aggregate call is the array type
! having the actual input type as elements. Note: array_accum() is now
! a built-in aggregate which uses a much more efficient mechanism than
! that which is provided by array_append, prior users of array_accum()
! may be pleasantly suprised at the marked improvment for larger arrays.
</para>

<para>
Index: src/backend/utils/adt/array_userfuncs.c
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/array_userfuncs.c,v
retrieving revision 1.20
diff -c -r1.20 array_userfuncs.c
*** src/backend/utils/adt/array_userfuncs.c 14 Jul 2006 14:52:23 -0000 1.20
--- src/backend/utils/adt/array_userfuncs.c 11 Oct 2006 04:38:46 -0000
***************
*** 15,20 ****
--- 15,22 ----
#include "utils/array.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "nodes/execnodes.h"


/*-----------------------------------------------------------------------------
***************
*** 399,404 ****
--- 401,516 ----
PG_RETURN_ARRAYTYPE_
P(result);
}

+ /* Structure, used by aaccum_sfunc and aaccum_ffunc to
+ * implement the array_accum() aggregate, for storing
+ * pointers to the ArrayBuildState for the array we are
+ * building and the MemoryContext in which it is being
+ * built. Note that this structure is
+ * considered an 'anyarray' externally, which is a
+ * variable-length datatype, and therefore
+ * must open with an int32 defining the length. */
+ typedef struct {
+ int32 vl_len;
+ ArrayBuildState *a
state;
+ MemoryContext arrctx;
+ } aaccum_info;
+
+ /*-----------------------------------------------------------------------------
+ * aaccum_sfunc :
+ * State transistion function for the array_accum() aggregate,
+ * efficiently builds an in-memory array by working in blocks and
+ * minimizing realloc()'s and copying of the data in general.
+ * Creates a seperate memory context attached to the AggContext into
+ * which the array is built. That context is free'd when the final
+ * function is called (aaccum_ffunc). accumArrayResult() does all
+ * the heavy lifting here, this is really just a glue function.
+ *----------------------------------------------------------------------------
+ */
+ Datum
+ aaccum_sfunc(PG_FUNC
TION_ARGS)
+ {
+ aaccum_info *ainfo
;
+ AggState *aggstate
;
+
+ /* Make sure we are being called in an aggregate. */
+ if (!fcinfo->context || !IsA(fcinfo->context, AggState))
+ ereport(ERROR,
+ (errcode(ERRCODE
_FEATURE_NOT_SUPPORT
ED),
+ errmsg("Can not call aaccum_sfunc as a non-aggregate"),
+ errhint("Use the array_accum aggregate")));
+
+ aggstate = (AggState*) fcinfo->context;
+
+ /* Initial call passes NULL in for our state variable.
+ * Allocate memory to store the pointers in and create
+ * our context. */
+ if (PG_ARGISNULL(0)) {
+ /* Allocate memory to hold the pointers to the ArrayBuildState
+ * and the MemoryContext where we are building the array. Note
+ * that we can do this in the CurrentMemoryContext
because when
+ * we return the storage "bytea" will be copied into the AggState
+ * context by the caller and passed back to us on the next call. */
+ ainfo = (aaccum_info*) palloc(sizeof(aaccum
_info));
+ ainfo->vl_len = sizeof(aaccum_info);

+ ainfo->astate = NULL;
+
+ /* New context created which will store our array accumulation.
+ * The parent is the AggContext for this query since it needs to
+ * persist for the same timeframe as the state value.
+ * The state value holds the pointers to the ArrayBuildState and this
+ * MemoryContext through the aaccum_info structure. */
+ ainfo->arrctx = AllocSetContextCreat
e(aggstate->aggcontext, "ArrayAccumCtx",
+ ALLOCSET_DEFAULT_MIN
SIZE,
+ ALLOCSET_DEFAULT_INI
TSIZE,
+ ALLOCSET_DEFAULT_MAX
SIZE);
+ } else {
+ /* Our state variable is non-null, therefore it must be an existing
+ * ainfo structure. */
+ ainfo = (aaccum_info*) PG_GETARG_BYTEA_P(0)
;
+ }
+
+ /* Pull the element to be added and pass it along with the ArrayBuildState
+ * and ArrayAccumCtx MemoryContext to accumArrayResult, checking if it is
+ * NULL or not. */
+ ainfo->astate = accumArrayResult(ain
fo->astate,
+ PG_ARGISNULL(1) ? (Datum) 0 : PG_GETARG_DATUM(1),
+ PG_ARGISNULL(1),
+ get_fn_expr_argtype(
fcinfo->flinfo, 1),
+ ainfo->arrctx);
+
+ /* Caller will copy storage into the AggContext after the first call and then
+ * should not touch it as we will always return the same pointer passed in. */
+ PG_RETURN_BYTEA_P(a
info);
+ }
+
+ /*-----------------------------------------------------------------------------
+ * aaccum_ffunc :
+ * Final function for the array_accum() aggregate, creates the final
+ * finished array and passes it back to the user. Also deletes the
+ * memory context created by the aaccum_sfunc(). makeArrayResult()
+ * does all the heavy lifting here, this is really just a glue function.
+ *----------------------------------------------------------------------------
+ */
+ Datum
+ aaccum_ffunc(PG_FUNC
TION_ARGS)
+ {
+ aaccum_info *ainfo
;
+
+ /* Check if we are passed in a NULL */
+ if (PG_ARGISNULL(0)) PG_RETURN_ARRAYTYPE_
P(NULL);
+
+ /* Make sure we are being called in an aggregate. */
+ if (!fcinfo->context || !IsA(fcinfo->context, AggState))
+ ereport(ERROR,
+ (errcode(ERRCODE
_FEATURE_NOT_SUPPORT
ED),
+ errmsg("Can not call aaccum_sfunc as a non-aggregate"),
+ errhint("Use the array_accum aggregate")));
+
+ ainfo = (aaccum_info*) PG_GETARG_BYTEA_P(0)
;
+
+ /* makeArrayResult will delete ainfo->arrctx for us. */
+ PG_RETURN_ARRAYTYPE
_P(makeArrayResult(a
info->astate, ainfo->arrctx));
+ }

/*
* used by text_to_array() in varlena.c
Index: src/include/catalog/pg_aggregate.h
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_aggregate.h,v
retrieving revision 1.58
diff -c -r1.58 pg_aggregate.h
*** src/include/catalog/pg_aggregate.h 4 Oct 2006 00:30:07 -0000 1.58
--- src/include/catalog/pg_aggregate.h 11 Oct 2006 04:38:46 -0000
***************
*** 221,226 ****
--- 221,229 ----
DATA(insert ( 2242 bitand - 0 1560 _null_ ));
DATA(insert ( 2243 bitor - 0 1560 _null_ ));

+ /* array accumulation */
+ DATA(insert ( 322 aaccum_sfunc aaccum_ffunc 0 227
7 _null_ ));
+
/*
* prototypes for functions in pg_aggregate.c
*/
Index: src/include/catalog/pg_proc.h
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_proc.h,v
retrieving revision 1.427
diff -c -r1.427 pg_proc.h
*** src/include/catalog/pg_proc.h 4 Oct 2006 00:30:07 -0000 1.427
--- src/include/catalog/pg_proc.h 11 Oct 2006 04:38:47 -0000
***************
*** 1017,1022 ****
--- 1017,1026 ----
DESCR("larger of two");
DATA(insert OID = 516 ( array_smaller PGNSP PGUID 12 f f t f i 2 2277 "2277 2277" _null_ _null_ _null_ array_smaller - _null_ ));
DESCR("smaller of two");
+ DATA(insert OID = 320 ( aaccum_sfunc PGNSP PGUID 12 f f f f i 2 2277 "2277 2283" _null_ _null_ _null_ aaccum_sfunc - _null_ ));
+ DESCR("array_accum aggregate state function");
+ DATA(insert OID = 321 ( aaccum_ffunc PGNSP PGUID 12 f f f f i 1 2277 "2277" _null_ _null_ _null_ aaccum_ffunc - _null_ ));
+ DESCR("array_accum aggregate final function");

DATA(insert OID = 760 ( smgrin PGNSP PGUID 12 f f t f s 1 210 "2275" _null_ _null_ _null_ smgrin - _null_ ));
DESCR("I/O");
***************
*** 3252,3257 ****
--- 3256,3263 ----
DATA(insert OID = 2828 ( covar_samp PGNSP PGUID 12 t f f f i 2 701 "701 701" _null_ _null_ _null_ aggregate_dummy - _null_ ));
DATA(insert OID = 2829 ( corr PGNSP PGUID 12 t f f f i 2 701 "701 701" _null_ _null_ _null_ aggregate_dummy - _null_ ));

+ DATA(insert OID = 322 ( array_accum PGNSP PGUID 12 t f f f i 1 2277 "2283" _null_ _null_ _null_ aggregate_dummy - _null_ ));
+
DATA(insert OID = 2160 ( text_pattern_lt PGNSP PGUID 12 f f t f i 2 16 "25 25" _null_ _null_ _null_ text_pattern_lt - _null_ ));
DATA(insert OID = 2161 ( text_pattern_le PGNSP PGUID 12 f f t f i 2 16 "25 25" _null_ _null_ _null_ text_pattern_le - _null_ ));
DATA(insert OID = 2162 ( text_pattern_eq PGNSP PGUID 12 f f t f i 2 16 "25 25" _null_ _null_ _null_ text_pattern_eq - _null_ ));
Index: src/include/utils/array.h
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/include/utils/array.h,v
retrieving revision 1.59
diff -c -r1.59 array.h
*** src/include/utils/array.h 10 Sep 2006 20:14:20 -0000 1.59
--- src/include/utils/array.h 11 Oct 2006 04:38:47 -0000
***************
*** 266,271 ****
--- 266,274 ----
*/
extern Datum array_push(PG_FUNCTI
ON_ARGS);
extern Datum array_cat(PG_FUNCTIO
N_ARGS);
+ extern Datum aaccum_sfunc(PG_FUNC
TION_ARGS);
+ extern Datum aaccum_ffunc(PG_FUNC
TION_ARGS);
+

extern ArrayType *create_singleton_ar
ray(FunctionCallInfo
fcinfo,
Oid element_type,
Index: src/test/regress/expected/aggregates.out
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/test/regress/expected/aggregates.out,v
retrieving revision 1.15
diff -c -r1.15 aggregates.out
*** src/test/regress/expected/aggregates.out 28 Jul 2006 18:33:04 -0000 1.15
--- src/test/regress/expected/aggregates.out 11 Oct 2006 04:38:47 -0000
***************
*** 236,241 ****
--- 236,248 ----
9 | 100 | 4
(10 rows)

+ -- array accumulation aggregate
+ SELECT array_accum(generate
_series) from generate_series(0,5)
;
+ array_accum
+ ---------------
+ {0,1,2,3,4,5}
+ (1 row)
+
-- user-defined aggregates
SELECT newavg(four) AS avg_1 FROM onek;
avg_1
Index: src/test/regress/sql/aggregates.sql
====================
====================
====================
=======
RCS file: /projects/cvsroot/pgsql/src/test/regress/sql/aggregates.sql,v
retrieving revision 1.13
diff -c -r1.13 aggregates.sql
*** src/test/regress/sql/aggregates.sql 28 Jul 2006 18:33:04 -0000 1.13
--- src/test/regress/sql/aggregates.sql 11 Oct 2006 04:38:48 -0000
***************
*** 59,64 ****
--- 59,67 ----
select ten, count(four), sum(DISTINCT four) from onek
group by ten order by ten;

+ -- array accumulation aggregate
+ SELECT array_accum(generate
_series) from generate_series(0,5)
;
+
-- user-defined aggregates
SELECT newavg(four) AS avg_1 FROM onek;
SELECT newsum(four) AS sum_1500 FROM onek;

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Neil Conway

2006-10-25, 8:29 am

On Wed, 2006-10-11 at 00:51 -0400, Stephen Frost wrote:
> An array_accum aggregate has existed in the documentation for quite
> some time using the inefficient (for larger arrays) array_append
> routine.


My vague recollection is that array_accum() is only defined in the
documentation because there was some debate about how it ought to behave
in some corner cases. I can't recall the details, but it was discussed
when array_accum() was originally proposed by Joe. Does anyone remember?

Minor comments on the patch:

> *** 132,138 ****
> </programlisting>
>
> Here, the actual state type for any aggregate call is the array type
> ! having the actual input type as elements.
> </para>
>
> <para>
> --- 132,141 ----
> </programlisting>
>
> Here, the actual state type for any aggregate call is the array type
> ! having the actual input type as elements. Note: array_accum() is now
> ! a built-in aggregate which uses a much more efficient mechanism than
> ! that which is provided by array_append, prior users of array_accum()
> ! may be pleasantly suprised at the marked improvment for larger arrays.
> </para>


Not worth documenting, I think.

> *** 15,20 ****
> --- 15,22 ----
> #include "utils/array.h"
> #include "utils/builtins.h"
> #include "utils/lsyscache.h"
> + #include "utils/memutils.h"
> + #include "nodes/execnodes.h"


Include directives should be sorted roughly alphabetically, with the
exception of postgres.h and system headers.

> + /* Structure, used by aaccum_sfunc and aaccum_ffunc to
> + * implement the array_accum() aggregate, for storing
> + * pointers to the ArrayBuildState for the array we are
> + * building and the MemoryContext in which it is being
> + * built. Note that this structure is
> + * considered an 'anyarray' externally, which is a
> + * variable-length datatype, and therefore
> + * must open with an int32 defining the length. */


Gross.

> + /* Make sure we are being called in an aggregate. */
> + if (!fcinfo->context || !IsA(fcinfo->context, AggState))
> + ereport(ERROR,
> + (errcode(ERRCODE
_FEATURE_NOT_SUPPORT
ED),
> + errmsg("Can not call aaccum_sfunc as a non-aggregate"),
> + errhint("Use the array_accum aggregate")));


Per the error message style guidelines, errmsg() should begin with a
lowercase letter and errhint() should be a complete sentence (so it
needs punctuation, for example).

> + Datum
> + aaccum_ffunc(PG_FUNC
TION_ARGS)
> + {
> + aaccum_info *ainfo
;
> +
> + /* Check if we are passed in a NULL */
> + if (PG_ARGISNULL(0)) PG_RETURN_ARRAYTYPE_
P(NULL);


There is no guarantee why SQL NULL and PG_RETURN_XYZ(NULL) refer to the
same thing -- use PG_RETURN_NULL() to return a SQL NULL value, or just
make the function strict.

-Neil



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Stephen Frost

2006-10-25, 8:29 am

* Neil Conway (neilc@samurai.com) wrote:
> On Wed, 2006-10-11 at 00:51 -0400, Stephen Frost wrote:
>
> Not worth documenting, I think.


Ok. Either way is fine with me, honestly.

>
> Include directives should be sorted roughly alphabetically, with the
> exception of postgres.h and system headers.


Ah, yeah, you know I noticed that when I added memutils but wasn't
thinking when I went back and added execnodes (I had been trying to
minimize the number of includes by adding ones I needed one at a time
and seeing if any more were necessary).

>
> Gross.


Indeed. :/

>
> Per the error message style guidelines, errmsg() should begin with a
> lowercase letter and errhint() should be a complete sentence (so it
> needs punctuation, for example).


Ah, ok, guess I should go read those.

>
> There is no guarantee why SQL NULL and PG_RETURN_XYZ(NULL) refer to the
> same thing -- use PG_RETURN_NULL() to return a SQL NULL value, or just
> make the function strict.


Huh, alright. I'll probably just change it to PG_RETURN_NULL().

Thanks!

Stephen

Stephen Frost

2006-10-25, 8:29 am

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> (However, now that we support nulls in arrays, meseems a more consistent
> definition would be that it allows null inputs and just includes them in
> the output. So probably you do need it non-strict.)


This was my intention.

> I'm inclined to think that this example demonstrates a deficiency in the
> aggregate-function design: there should be a way to declare what we're
> really doing. But I don't know exactly what that should look like.


I agree and would much rather have a clean solution which works with the
design than one which has to work outside it. When I first was trying
to decide on the state-type I was looking through the PG catalogs for
essentially a "complex C type" which translated to a void*. Perhaps
such a type could be added. Unless that's considered along the lines of
an 'any' type it'd cause problems for the polymorphism aspect.

Another alternative would be to provide a seperate area for each
aggregate to put any other information it needs. This would almost
certainly only be available to C functions but would essentially be a
void* which is provided through the AggState structure but tracked by
the aggregator routines and reset for each aggregate function being
run. If that's acceptable, I don't think it'd be all that difficult to
implement. With that, aaccum_sfunc and aaccum_ffunc would ignore the
state variable passed to them in favor of their custom structure
available through fcinfo->AggState (I expect they'd just keep the
state variable NULL and be marked non-strict, or set it to some constant
if necessary). The pointer would have to be tracked somewhere and then
copied in/out on each call, but that doesn't seem too difficult to me.
After all, the state variable is already being tracked somewhere, this
would just sit next to it, in my head anyway.

I've got some time this weekend and would be happy to take a shot at
the second proposal if that's generally acceptable.

Thanks,

Stephen

Tom Lane

2006-10-25, 8:29 am

Stephen Frost <sfrost@snowman.net> writes:
> Another alternative would be to provide a seperate area for each
> aggregate to put any other information it needs.


I'm not convinced that that's necessary --- the cases we have at hand
suggest that the transition function is perfectly capable of doing the
storage management it wants. The problem is how to declare to CREATE
AGGREGATE that we're using a transition function of this kind rather
than the "stupid" functions it expects. When the function is doing its
own storage management, we'd really rather that nodeAgg.c stayed out of
the way and didn't try to do any datum copying at all; having it copy a
placeholder bytea or anyarray or whatever is really a waste of cycles,
not to mention obscuring what is going on. If nodeAgg just provided a
pass-by-value Datum, which the transition function could use to store a
pointer to storage it's handling, things would be a lot cleaner.

After a little bit of thought I'm tempted to propose that we handle this
by inventing a new pseudotype called something like "aggregate_state",
which'd be declared in the catalogs as pass-by-value, thereby
suppressing useless copying activity in nodeAgg.c. You'd declare the
aggregate as having stype = aggregate_state, and the transition function
would have signature
sfunc(aggregate_stat
e, ... aggregate-input-type(s) ...)
returns aggregate_state
and the final function of course
ffunc(aggregate_stat
e) returns aggregate-result-type

aggregate_state would have no other uses in the system, and its input
and output functions would raise an error, so type safety is assured
--- there would be no way to call either the sfunc or ffunc "manually",
except by passing a NULL value, which should be safe because that's what
they'd expect as the aggregate initial condition.

One advantage of doing it this way is that the planner could be taught
to recognize aggregates with stype = aggregate_state specially, and make
allowance for the fact that they'll use more workspace than meets the
eye. If we don't have something like this then the planner is likely to
try to use hash aggregation in scenarios where it'd be absolutely fatal
to do so. I'm not sure whether we'd want to completely forbid hash
aggregation when any stype = aggregate_state is present, but for sure we
want to assume that there's some pretty large amount of per-aggregate
state we don't know about.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Tom Lane

2006-10-25, 8:29 am

I wrote:
> aggregate_state would have no other uses in the system, and its input
> and output functions would raise an error, so type safety is assured
> --- there would be no way to call either the sfunc or ffunc "manually",
> except by passing a NULL value, which should be safe because that's what
> they'd expect as the aggregate initial condition.


Um, no, I take that back, unless you want to invent a separate
pseudotype for each such aggregate. Otherwise you can crash it with

my_ffunc(your_sfunc(
null, whatever))

because my_ffunc will be expecting a datastructure different from what
it gets.

Maybe having a check for AggState call context is enough of a defense for
that, but I'm not really satisfied. Back to the drawing board ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com