|
Home > Archive > MS SQL Data Warehousing > March 2006 > Suggestion for Handling Very Large Tables (SQL 2000)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Suggestion for Handling Very Large Tables (SQL 2000)
|
|
| Christian Nunciato 2006-02-25, 9:28 am |
| Hi all:
I just learned today that a project I'm working on has a couple of very
large tables that need, somehow, to be optimized for searching. One table
has roughly 7 million rows, and another, its relative, has over 16 million.
At present, client applications only insert into these tables, but our users
have expressed a desire to select from them as well, using a small set of
search criteria, so we need to come up with an appropriate approach. At the
very least, to continue the application as it exists today, it seems to me
the two tables need to be trimmed down -- perhaps have their data archived
into separate tables, or databases, or some other approach -- but if users
also want to begin reporting on this data, allowing them to select against it
could bring the server, which is already overburdened, to its knees.
I'm an application developer and not a DBA, so I'm looking to you all for
some best-practices suggestions, here. I've looked briefly into SQL Analysis
Services, but I'm not sure thatt's the right approach for us, since it
appears Analysis Services is merely an interface for handling data
processing, and our users need real-time data (or as close to real-time as
possible -- big surprise, eh?). I'm wondering if simply segmenting the data
(which is very simply stored -- user ID, thing ID, date/time stamp, and
that's pretty much it) into separate tables might be an acceptable approach,
but that approach also seems a little crude to me. How do DBAs handle very
large t ables like this in a way that's least cumbersome to the client-side
developers and end users?
Thanks hugely in advance,
Chris
| |
|
| So you have very large tables in the operational environment that you
want to partition, but you still need to provide a convenient method for
developers to query against them? This is a common architectural issue, and
the standard answer is called [partitioned views].
http://msdn.microsoft.com/library/d...des_06_17zr.asp
http://www.microsoft.com/sql/prodin...calability.mspx
Also, there are techniques for near real time refreshing of OLAP data,
but whether or not you choose to go this additional step should perhaps
depend on what impact would result from the users querying against the
tables directly from the operational system. If the data is not that rich
(just a log table with few attributes of interest to users), then it may not
be worth the expense and maintenance of setting up a server for Analysis
Services.
http://msdn.microsoft.com/library/d...al-timeolap.asp
http://www.microsoft.com/technet/pr...n/anservog.mspx
"Christian Nunciato" < ChristianNunciato@di
scussions.microsoft.com> wrote in
message news:5BF9AF00-3A74-4044-A203- 2E5CAA48D18D@microso
ft.com...
> Hi all:
>
> I just learned today that a project I'm working on has a couple of very
> large tables that need, somehow, to be optimized for searching. One table
> has roughly 7 million rows, and another, its relative, has over 16
> million.
>
> At present, client applications only insert into these tables, but our
> users
> have expressed a desire to select from them as well, using a small set of
> search criteria, so we need to come up with an appropriate approach. At
> the
> very least, to continue the application as it exists today, it seems to me
> the two tables need to be trimmed down -- perhaps have their data archived
> into separate tables, or databases, or some other approach -- but if users
> also want to begin reporting on this data, allowing them to select against
> it
> could bring the server, which is already overburdened, to its knees.
>
> I'm an application developer and not a DBA, so I'm looking to you all for
> some best-practices suggestions, here. I've looked briefly into SQL
> Analysis
> Services, but I'm not sure thatt's the right approach for us, since it
> appears Analysis Services is merely an interface for handling data
> processing, and our users need real-time data (or as close to real-time as
> possible -- big surprise, eh?). I'm wondering if simply segmenting the
> data
> (which is very simply stored -- user ID, thing ID, date/time stamp, and
> that's pretty much it) into separate tables might be an acceptable
> approach,
> but that approach also seems a little crude to me. How do DBAs handle
> very
> large t ables like this in a way that's least cumbersome to the
> client-side
> developers and end users?
>
> Thanks hugely in advance,
>
> Chris
| |
| Mike S. 2006-02-25, 9:28 am |
| Chris,
JT has some good points and links in his response, and represent best
practices. I'll add to what he has said and give you the flip side - shall we
say, not so best practices, but a way to get by until you can implement them.
By no means are these recommendations for every situation, but they may fit
your needs.
16 million rows is a large set of data but depending on your circumstances,
may be manageable as it exists. Indexes of course, play a major role in
search speed and server workload. BUT, don't add any new indexes on that
large of a table unless you have a test environment to see how long the index
will take to build. Indexes also have the potential to increase your save
times as well as storage requirements substantially.
You may be able to tweak indexes so your save times increase a little but
your searches return data rapidly. Schema optimization is also paramount in
larger tables, but again, don't adjust it without having a lot of available
time and disk space to support the modifications.
If you can control the search select statements, keep the columns returned
to a minimum. If your environment allows potentially dirty reads, check into
using the WITH (NOLOCK) hint in the select statements.
All of this also depends on how fast you expect your data to grow.
Additionally, the introduction of partitioned views or any other large table
management methods will require additional maintenance. You may be able to
get by cheaper in the long run by upgrading your server now.
-Mike
"Christian Nunciato" wrote:
> Hi all:
>
> I just learned today that a project I'm working on has a couple of very
> large tables that need, somehow, to be optimized for searching. One table
> has roughly 7 million rows, and another, its relative, has over 16 million.
>
> At present, client applications only insert into these tables, but our users
> have expressed a desire to select from them as well, using a small set of
> search criteria, so we need to come up with an appropriate approach. At the
> very least, to continue the application as it exists today, it seems to me
> the two tables need to be trimmed down -- perhaps have their data archived
> into separate tables, or databases, or some other approach -- but if users
> also want to begin reporting on this data, allowing them to select against it
> could bring the server, which is already overburdened, to its knees.
>
> I'm an application developer and not a DBA, so I'm looking to you all for
> some best-practices suggestions, here. I've looked briefly into SQL Analysis
> Services, but I'm not sure thatt's the right approach for us, since it
> appears Analysis Services is merely an interface for handling data
> processing, and our users need real-time data (or as close to real-time as
> possible -- big surprise, eh?). I'm wondering if simply segmenting the data
> (which is very simply stored -- user ID, thing ID, date/time stamp, and
> that's pretty much it) into separate tables might be an acceptable approach,
> but that approach also seems a little crude to me. How do DBAs handle very
> large t ables like this in a way that's least cumbersome to the client-side
> developers and end users?
>
> Thanks hugely in advance,
>
> Chris
| |
|
| my first suggestion is:
disk subsystem!
increase the number of disks to increase the throughput. and setup a Raid
0+1.
scanning big tables with or without indexes is (near) allways slowdown by
the disks.
1 good IDE drive can read 60MB/s, if your table contain 5Gb of data... then
reading the table is long.
if you have 4 disks in Raid 0 (stripped) you can raise this value to 60 * 4
= 240MB/s.
its not so simple, because index and file fragmentation reduce this
throughput, but its a starting point.
for example, I have a database with a partitionned table which contains 110
millions of rows.
Doing a query like:
select D.column1, count(*) from table T inner join dimension D on D.key =
T.Key group by D.column1
(the key used is the partitionned key)
takes 8minutes on a dev server (1 dual core opteron, 3 IDE drives in raid 0)
During this process the CPU is used at 25%
The same query is slower (10 minutes) on the production server (4 Xeon, SAN)
because we have a SAN issue.
so you see the impact of the disk subsystem.
also, if you plan to do some group by queries, the tempdb database will be
used.
so if the tempdb database use his own set of disks, you'll improve the
performance.
from a developper point of view, maybe you have some control of the queries
executed against the database. So if you plan to aggregate data, maybe your
application (or your storedprocedures) can update the big table and the
aggregated table at the same time, then your "reports" will execute queries
against the summarized table.
but if you can't control what the user ask for, try using AS, specially
AS2005 which provide a lot of options for real time cubes. A webcast
demonstrate how to update a MOLAP cube based on a large table (more then
100millions of rows) in near real time.
"Christian Nunciato" < ChristianNunciato@di
scussions.microsoft.com> wrote in
message news:5BF9AF00-3A74-4044-A203- 2E5CAA48D18D@microso
ft.com...
> Hi all:
>
> I just learned today that a project I'm working on has a couple of very
> large tables that need, somehow, to be optimized for searching. One table
> has roughly 7 million rows, and another, its relative, has over 16
> million.
>
> At present, client applications only insert into these tables, but our
> users
> have expressed a desire to select from them as well, using a small set of
> search criteria, so we need to come up with an appropriate approach. At
> the
> very least, to continue the application as it exists today, it seems to me
> the two tables need to be trimmed down -- perhaps have their data archived
> into separate tables, or databases, or some other approach -- but if users
> also want to begin reporting on this data, allowing them to select against
> it
> could bring the server, which is already overburdened, to its knees.
>
> I'm an application developer and not a DBA, so I'm looking to you all for
> some best-practices suggestions, here. I've looked briefly into SQL
> Analysis
> Services, but I'm not sure thatt's the right approach for us, since it
> appears Analysis Services is merely an interface for handling data
> processing, and our users need real-time data (or as close to real-time as
> possible -- big surprise, eh?). I'm wondering if simply segmenting the
> data
> (which is very simply stored -- user ID, thing ID, date/time stamp, and
> that's pretty much it) into separate tables might be an acceptable
> approach,
> but that approach also seems a little crude to me. How do DBAs handle
> very
> large t ables like this in a way that's least cumbersome to the
> client-side
> developers and end users?
>
> Thanks hugely in advance,
>
> Chris
| |
| ChrisR 2006-03-05, 8:24 pm |
| I'm a big fan of using continuous SQL replication to a reporting server,
then you can index all you want on that box.
"Christian Nunciato" < ChristianNunciato@di
scussions.microsoft.com> wrote in
message news:5BF9AF00-3A74-4044-A203- 2E5CAA48D18D@microso
ft.com...
> Hi all:
>
> I just learned today that a project I'm working on has a couple of very
> large tables that need, somehow, to be optimized for searching. One table
> has roughly 7 million rows, and another, its relative, has over 16
million.
>
> At present, client applications only insert into these tables, but our
users
> have expressed a desire to select from them as well, using a small set of
> search criteria, so we need to come up with an appropriate approach. At
the
> very least, to continue the application as it exists today, it seems to me
> the two tables need to be trimmed down -- perhaps have their data archived
> into separate tables, or databases, or some other approach -- but if users
> also want to begin reporting on this data, allowing them to select against
it
> could bring the server, which is already overburdened, to its knees.
>
> I'm an application developer and not a DBA, so I'm looking to you all for
> some best-practices suggestions, here. I've looked briefly into SQL
Analysis
> Services, but I'm not sure thatt's the right approach for us, since it
> appears Analysis Services is merely an interface for handling data
> processing, and our users need real-time data (or as close to real-time as
> possible -- big surprise, eh?). I'm wondering if simply segmenting the
data
> (which is very simply stored -- user ID, thing ID, date/time stamp, and
> that's pretty much it) into separate tables might be an acceptable
approach,
> but that approach also seems a little crude to me. How do DBAs handle
very
> large t ables like this in a way that's least cumbersome to the
client-side
> developers and end users?
>
> Thanks hugely in advance,
>
> Chris
| |
| Stephan Eggermont 2006-03-05, 8:24 pm |
| "J?j?" < willgart@bbbhotmaila
aa.com> wrote:
> my first suggestion is:
> disk subsystem!
>
If you need a raid array, you have too little ram.
If it doesn't fit in one machine, split it.
16 Million rows is not so large that it would
not fit. 16 GByte of ram is below EUR 3000.
Stephan
| |
|
| Are you suggesting that he pin the entire table to RAM?
"Stephan Eggermont" <stephan@stack.nl> wrote in message
news:44057cbc$0$1106
9$e4fe514c@news.xs4all.nl...
> "J?j?" < willgart@bbbhotmaila
aa.com> wrote:
>
> If you need a raid array, you have too little ram.
> If it doesn't fit in one machine, split it.
> 16 Million rows is not so large that it would
> not fit. 16 GByte of ram is below EUR 3000.
>
> Stephan
| |
| Peter Nolan 2006-03-05, 8:24 pm |
| Hi Christian,
6 million and 17 million rows is a rather trivial amount of data
nowadays.....
Buy the hardware to support what you need to do....it will cost less
than spending time trying to figure out anything else....really, it
will.
Peter
|
|
|
|
|