Home > Archive > MS SQL Server > October 2006 > User Defined Aggregate Function For Finding Median









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author User Defined Aggregate Function For Finding Median
BDB

2006-10-24, 6:30 pm

Anybody have any luck with this one?
Anith Sen

2006-10-24, 6:30 pm

Please search the archives of this group. You should be able to find plenty.

--
Anith


BDB

2006-10-24, 6:30 pm


"Anith Sen" < anith@bizdatasolutio
ns.com> wrote in message
news:OkS4uTm3GHA.1268@TK2MSFTNGP02.phx.gbl...
> Please search the archives of this group. You should be able to find
> plenty.


Well the difficulty occurs when I cannot persist a actually set of values
given the limitations of implementing user-defined aggregate functions.

I haven't found anything in the archive of this group that addresses this.

Thanks.

Arnie Rowland

2006-10-24, 6:30 pm

Good Luck. There are many alternative attempts. Here is one of the better to
work with options.

http://databases.aspfaq.com/databas...in-a-table.html

YOu may also wish to google for CELKO and MEDIAN.


--
Arnie Rowland, Ph.D.
Westwood Consulting, Inc

Most good judgment comes from experience.
Most experience comes from bad judgment.
- Anonymous


"BDB" <bdb@reply.to.group.com> wrote in message
news:e9mjzLm3GHA.2152@TK2MSFTNGP06.phx.gbl...
> Anybody have any luck with this one?



--CELKO--

2006-10-24, 6:30 pm

Median with Characteristic Function

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a
series of articles in SQL Forum magazine on computing the median (SQL
Forum 1993, 1994). They define a characteristic function, which they
call delta, using the Sybase sign() function. The delta or
characteristic function accepts a Boolean expression as an argument and
returns a 1 if it is TRUE and a zero if it is FALSE or UNKNOWN.

In SQL-92 we have a CASE expression, which can be used to construct the
delta function. This is new to SQL-92, but you can find vendor
functions of the form IF...THEN...ELSE that behave like the condition
expression in Algol or like the question markPcolon operator in C.

The authors also distinguish between the statistical median, whose
value must be a member of the set, and the financial median, whose
value is the average of the middle two members of the set. A
statistical median exists when there is an odd number of items in the
set. If there is an even number of items, you must decide if you want
to use the highest value in the lower half (they call this the left
median) or the lowest value in the upper half (they call this the right
median).

The left statistical median of a unique column can be found with this
query:

SELECT P1.bin
FROM Parts AS P1, Parts AS P2
GROUP BY P1.bin
HAVING SUM(CASE WHEN (P2.bin <= P1.bin) THEN 1 ELSE 0 END)
= (COUNT(*) + 1) / 2;

Changing the direction of the theta test in the HAVING clause will
allow you to pick the right statistical median if a central element
does not exist in the set. You will also notice something else about
the median of a set of unique values: It is usually meaningless. What
does the median bin number mean, anyway? A good rule of thumb is that
if it does not make sense as an average, it does not make sense as a
median.

The statistical median of a column with duplicate values can be found
with a query based on the same ideas, but you have to adjust the HAVING
clause to allow for overlap; thus, the left statistical median is found
by

SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.weight
HAVING SUM(CASE WHEN P2.weight <= P1.weight
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)

AND SUM(CASE WHEN P2.weight >= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1);


Notice that here the left and right medians can be the same, so there
is no need to pick one over the other in many of the situations where
you have an even number of items. Switching the comparison operators in
the two CASE expressions will give you the right statistical median.

The author's query for the financial median depends on some Sybase
features that cannot be found in other products, so I would recommend
using a combination of the right and left statistical medians to return
a set of values about the center of the data, and then averaging them,
thus:

SELECT AVG(P1.weight)
FROM Parts AS P1, Parts AS P2
HAVING (SUM(CASE WHEN P2.weight <= P1.weight -- left median
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)

AND SUM(CASE WHEN P2.weight >= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1))

OR (SUM(CASE WHEN P2.weight >= P1.weight -- right median
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)

AND SUM(CASE WHEN P2.weight <= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1));


An optimizer may be able to reduce this expression internally, since
the expressions involved with COUNT(*) are constants. This entire query
could be put into a FROM clause and the average taken of the one or two
rows in the result to find the financial median. In SQL-89, you would
have to define this as a VIEW and then take the average.

Another version using some SQL-99 extensions:

SELECT AVG(x),
ROW_NUMBER () OVER (ORDER BY x ASC) AS hi,
ROW_NUMBER () OVER (ORDER BY x DESC) AS lo
FROM Foobar
WHERE hi IN (lo, lo-1, lo+1);

Kent Tegels

2006-10-24, 6:30 pm

Hello --CELKO--,

> Another version using some SQL-99 extensions:
> SELECT AVG(x),
> ROW_NUMBER () OVER (ORDER BY x ASC) AS hi,
> ROW_NUMBER () OVER (ORDER BY x DESC) AS lo
> FROM Foobar
> WHERE hi IN (lo, lo-1, lo+1);


Which is brillant but doesn't work in SQL Server 2005 due to their *interesting*
implementation of ranking functions. From my blog, here's a version that
does work for us:

with f as (
select X,
row_number() over (order by X asc) as hi,
row_number() over (order by X desc) as lo
from dbo.foobar)
select X as median from f
where hi in (lo,lo+1,lo-1)
go

Cheers,
Kent Tegels
DevelopMentor


BDB

2006-10-24, 6:30 pm


"BDB" <bdb@reply.to.group.com> wrote in message
news:e9mjzLm3GHA.2152@TK2MSFTNGP06.phx.gbl...
> Anybody have any luck with this one?


Thanks guys but the problem is not finding median in SQL. The problem is
implementing a user define aggregate function. The problem is over coming
the inherit problems of persisting a intermittent list in heap memory incase
the aggregation has to happen in parallel. The UDA framework appears to
only allow 8000 bytes for an accumulator.

markc600@hotmail.com

2006-10-24, 6:31 pm


Using the query on the data below gives the answer 3, I think it should
be 4.

create table foobar(X int)
insert into foobar(X) values(1)
insert into foobar(X) values(3)
insert into foobar(X) values(5)
insert into foobar(X) values(5)


Here's another alternative

with f as (
select X,
row_number () over (order by X asc) as rn,
count(*) over() as cn
from dbo.foobar)
select avg(X) as median
from f
where rn between cn-rn and cn-rn+2

Alex Kuznetsov

2006-10-24, 6:31 pm

Kent,

I think Joe's version would not work neither on Oracle nor on DB2, and
their implementation makes perfect sense to me:
they evaluate WHERE clause first, then compute OLAP functions. Your
version would work on all Big 3 databases.

-----------------------
Alex Kuznetsov
http://sqlserver-tips.blogspot.com/
http://sqlserver-puzzles.blogspot.com/

Kent Tegels

2006-10-24, 6:31 pm

> Thanks guys but the problem is not finding median in SQL. The problem
> is implementing a user define aggregate function. The problem is over
> coming the inherit problems of persisting a intermittent list in heap
> memory incase the aggregation has to happen in parallel. The UDA
> framework appears to only allow 8000 bytes for an accumulator.


That's correct and I agree. Something I've wondered about is using binary
serialization of an object graph and System.IO.Compression to "fit the size
12 girl in the size 6 dress." It would probably work to some degree but performance
would be horrible.

Cheers,
Kent


Dieter Noeth

2006-10-24, 6:31 pm

Kent Tegels wrote:

>
> Which is brillant but doesn't work in SQL Server 2005 due to their
> *interesting* implementation of ranking functions.


Why *interesting*?
According to Standard SQL OLAP-functions are processed after WHERE/GROUP
BY/HAVING, so you must use a Derived Table to filter on RANK.

Dieter
BDB

2006-10-24, 6:31 pm


"Kent Tegels" <ktegels@develop.com> wrote in message
news:f2b9912a52988c8
ac6d521726d0@news.microsoft.com...
>
> That's correct and I agree. Something I've wondered about is using binary
> serialization of an object graph and System.IO.Compression to "fit the
> size 12 girl in the size 6 dress." It would probably work to some degree
> but performance would be horrible.
>
> Cheers,
> Kent


LOL.....
My co-workers and I had a good laugh at your post! Thanks.

--CELKO--

2006-10-24, 6:31 pm

>> Which is brillant but doesn't work in SQL Server 2005 due to their *interesting* implementation of ranking functions. From my blog, here's a version that does work for us:<<

.... and anyone else! I like it and I will steal it :) Imitation is
the sincerest form of flattery. Plagiarism is the sincerest form of
imitation.

Kent Tegels

2006-10-24, 6:31 pm

Hello --CELKO--,

> ... and anyone else! I like it and I will steal it :) Imitation is
> the sincerest form of flattery. Plagiarism is the sincerest form of
> imitation.


I'll remember that when I lift examples from your OLAP book for mine then. :)

Cheers,
kt


Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com