Home > Archive > PostgreSQL Performance > August 2005 > Query plan looks OK, but slow I/O - settings advice?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Query plan looks OK, but slow I/O - settings advice?
Roger Hand

2005-08-19, 3:26 am

Summary
=======
We are writing to the db pretty much 24 hours a day.
Recently the amount of data we write has increased, and the query speed, formerly okay, has taken a dive.
The query is using the indexes as expected, so I don't _think_ I have a query tuning issue, just an io problem.
The first time a query is done it takes about 60 seconds. The second time it runs in about 6 seconds.
What I know I need advice on is io settings and various buffer settings.
I may also need advice on other things, but just don't know it yet!

Below is ...
- an explain analyze
- details of the db setup and hardware
- some vmstat and iostat output showing the disks are very busy
- the SHOW ALL output for the db config.

Details
=======
Postgres 8.0.3

Below is a sample query. (This is actually implemented as a prepared statement. Here I fill in the '?'s with actual values.)

electric=# EXPLAIN ANALYZE
electric-# SELECT datavalue, logfielddatatype, timestamp FROM logdata_recent
electric-# WHERE (logfielddatatype = 70 OR logfielddatatype = 71 OR logfielddatatype = 69)
electric-# AND graphtargetlog = 1327
electric-# AND timestamp >= 1123052400 AND timestamp <= 1123138800
electric-# ORDER BY timestamp;
QUERY PLAN

--------------------------------------------------
Sort (cost=82.48..82.50 rows=6 width=14) (actual time=60208.968..60211.232 rows=2625 loops=1)
Sort Key: public.logdata_recent."timestamp"
-> Result (cost=0.00..82.41 rows=6 width=14) (actual time=52.483..60200.868 rows=2625 loops=1)
-> Append (cost=0.00..82.41 rows=6 width=14) (actual time=52.476..60189.929 rows=2625 loops=1)
-> Seq Scan on logdata_recent (cost=0.00..46.25 rows=1 width=14) (actual time=0.003..0.003 rows=0 loops=1)
Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND (graphtargetlog = 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800))
-> Index Scan using logdata_recent_11230
85306_ix_t_fld_gtl, logdata_recent_11230
85306_ix_t_fld_gtl, logdata_recent_11230
85306_ix_t_fld_gtl on logdata_recent_stale
logdata_recent (cost=0.00..18.08 rows=3 width=14) (actual time=52.465..6018
1.624 rows=2625 loops=1)
Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 70) AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 71) AND (grapht
argetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 69) AND (graphtargetlog = 1327)))
Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND (graphtargetlog = 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800))
-> Index Scan using logdata_recent_11231
39634_ix_t_fld_gtl, logdata_recent_11231
39634_ix_t_fld_gtl, logdata_recent_11231
39634_ix_t_fld_gtl on logdata_recent_activ
e logdata_recent (cost=0.00..18.08 rows=2 width=14) (actual time=0.178..0.17
8 rows=0 loops=1)
Index Cond: ((("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 70) AND (graphtargetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 71) AND (grapht
argetlog = 1327)) OR (("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800) AND (logfielddatatype = 69) AND (graphtargetlog = 1327)))
Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND (graphtargetlog = 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800))
Total runtime: 60214.545 ms
(13 rows)

60 seconds is much longer than it used to be. I would guess it used to be under 10 seconds. The second time the above query is run we see the magic of caching as the time goes down to 6 seconds.

logdata_recent_activ
e and logdata_recent_stale
are inherited tables of logdata_recent, which never has any data. (This is pseudo-partitioning in action!)
So the very quick seq_scan on the empty logdata_recent parent table is okay with me.

The index is built on timestamp, logfielddatatype, graphtargetlog. I am curious as to why the same index shows up 3 times in the "using" clause, but can live without knowing the details as long as it doesn't indicate that something's wrong.

The logdata_recent_stale
table has 5 millions rows. The size of the table itself, on disk, is 324MB. The size of the index is 210MB.

The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change in speed.

They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
4 disks raid 10 for indexes (145GB) - sdc1
6 disks raid 10 for data (220GB) - sdd1
2 mirrored disks for logs - sdb1

stripe size is 32k
cache policy: cached io (am told the controller has bbu)
write policy: write-back
read policy: readahead

The partition names do what they say ...
[root@rage-db2 /dbdata01]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 70430588 729324 66123592 2% /dblog01
/dev/sdc1 140861236 19472588 114233300 15% /dbindex01
/dev/sdd1 211299960 157159988 43406548 79% /dbdata01
....

Using iostat (the version from http://linux.inet.hr/) I saw at one point that the data disk was 100% busy.
I believe this was when running the above query, or similar, but in any case the system is always busy with both reads and (usually) writes.

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 61 0.1 15.5 0.4 305.7 19.6 0.1 5.8 4.9 8
sdc1 21 22 20.6 17.7 164.6 158.6 8.4 1.1 28.3 6.2 24
sdd1 1742 11 1904.7 6.6 14585.6 71.5 7.7 20.6 10.8 0.5 100

Another time, when I was running the query above, the index partition went to 90+% busy for 40 seconds:

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 0 0.0 0.2 0.0 0.8 4.0 0.0 20.0 15.0 0
sdc1 366 53 687.0 66.1 4213.1 483.0 6.2 11.8 15.7 1.3 96
sdd1 8 17 16.6 13.9 99.5 125.4 7.4 0.7 23.0 1.9 6

On another occasion (when the query took 24 seconds) I ran vmstat and iostat every 5 seconds
from just before the query until just after. About the first two outputs are before the query.
In this case the index disk is maxed.

[root@rage-db2 ~]$ vmstat 5 16
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy wa id
0 0 92 1233500 225692 9578564 0 0 0 0 1 1 2 2 1 1
0 1 92 1218460 225748 9595136 0 0 3322 18 655 898 0 0 20 79
0 1 92 1202124 225780 9616140 0 0 4204 58 920 1291 0 1 24 76
0 1 92 1172876 225820 9645348 0 0 5847 120 1053 1482 0 1 23 76
1 0 92 1151712 225836 9666504 0 0 4234 7 847 1239 2 1 18 78
1 0 92 1140860 225844 9677436 0 0 2153 500 575 2027 13 2 11 73
1 0 92 1140852 225848 9677636 0 0 0 506 213 442 10 1 0 89

[root@rage-db2 ~]$ /usr/local/bin/iostat -Px 5 16

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 1 243 0.1 105.5 2.7 37.6 0.4 0.0 0.2 0.1 1
sdc1 6 111 3.7 75.9 38.3 769.3 10.1 0.0 0.3 0.2 1
sdd1 255 107 85.2 37.6 4.9 581.0 4.8 0.0 0.1 0.0 1

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
sdc1 273 0 414.0 0.4 2747.7 2.4 6.6 1.6 3.9 1.7 69
sdd1 0 1 1.4 0.4 7.2 5.6 7.1 0.0 10.0 6.7 1

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 0 0.0 0.4 0.0 1.6 4.0 0.0 10.0 5.0 0
sdc1 225 4 777.1 4.6 4011.0 35.1 5.2 2.5 3.2 1.3 99
sdd1 0 2 0.0 2.6 0.0 16.8 6.5 0.0 8.5 0.8 0

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
sdc1 508 7 917.8 7.4 5703.0 58.3 6.2 2.2 2.4 1.1 98
sdd1 0 4 0.0 6.8 0.0 44.7 6.6 0.1 15.6 0.6 0

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
sdc1 361 0 737.5 0.4 4391.7 2.4 6.0 1.8 2.4 1.0 76
sdd1 0 0 0.0 0.4 0.0 2.4 6.0 0.0 0.0 0.0 0

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 87 0.0 17.8 0.0 418.3 23.6 0.0 1.3 1.2 2
sdc1 216 2 489.5 0.4 2821.7 11.2 5.8 1.2 2.4 1.1 56
sdd1 2 4 7.2 0.6 37.5 18.4 7.2 0.0 6.2 3.3 3

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 89 0.0 22.4 0.0 446.3 20.0 0.0 1.1 0.8 2
sdc1 0 4 0.0 1.0 0.0 18.4 18.4 0.0 0.0 0.0 0
sdd1 0 6 0.0 1.0 0.0 27.1 27.2 0.0 0.0 0.0 0

device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdb1 0 89 0.0 22.5 0.0 446.2 19.8 0.0 0.4 0.3 1
sdc1 0 2 0.0 0.4 0.0 9.6 24.0 0.0 0.0 0.0 0
sdd1 0 4 0.0 0.6 0.0 20.0 33.3 0.0 0.0 0.0 0


Finally, here's a show all:

add_missing_from | on
archive_command | unset
australian_timezones
| off
authentication_timeo
ut | 60
bgwriter_delay | 200
bgwriter_maxpages | 100
bgwriter_percent | 1
block_size | 8192
check_function_bodie
s | on
checkpoint_segments | 20
checkpoint_timeout | 300
checkpoint_warning | 30
client_encoding | UNICODE
client_min_messages | notice
commit_delay | 350
commit_siblings | 5
config_file | /dbdata01/pgdata/postgresql.conf
cpu_index_tuple_cost
| 0.001
cpu_operator_cost | 0.0025
cpu_tuple_cost | 0.01
custom_variable_clas
ses | unset
data_directory | /dbdata01/pgdata
DateStyle | ISO, MDY
db_user_namespace | off
deadlock_timeout | 1000
debug_pretty_print | off
debug_print_parse | off
debug_print_plan | off
debug_print_rewritte
n | off
debug_shared_buffers
| 0
default_statistics_t
arget | 50
default_tablespace | unset
default_transaction_
isolation | read committed
default_transaction_
read_only | off
default_with_oids | off
dynamic_library_path
| $libdir
effective_cache_size
| 48000
enable_hashagg | on
enable_hashjoin | on
enable_indexscan | on
enable_mergejoin | on
enable_nestloop | on
enable_seqscan | on
enable_sort | on
enable_tidscan | on
explain_pretty_print
| on
external_pid_file | unset
extra_float_digits | 0
from_collapse_limit | 8
fsync | on
geqo | on
geqo_effort | 5
geqo_generations | 0
geqo_pool_size | 0
geqo_selection_bias | 2
geqo_threshold | 12
hba_file | /dbdata01/pgdata/pg_hba.conf
ident_file | /dbdata01/pgdata/pg_ident.conf
integer_datetimes | off
join_collapse_limit | 8
krb_server_keyfile | unset
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
listen_addresses | *
log_connections | off
log_destination | stderr
log_directory | /dblog01
log_disconnections | off
log_duration | off
log_error_verbosity | default
log_executor_stats | off
log_filename | postgresql-%Y-%m-%d_%H%M%S.log
log_hostname | off
log_line_prefix | unset
log_min_duration_sta
tement | -1
log_min_error_statem
ent | panic
log_min_messages | notice
log_parser_stats | off
log_planner_stats | off
log_rotation_age | 1440
log_rotation_size | 10240
log_statement | none
log_statement_stats | off
log_truncate_on_rota
tion | off
maintenance_work_mem
| 262144
max_connections | 40
max_files_per_proces
s | 1000
max_fsm_pages | 100000
max_fsm_relations | 1000
max_function_args | 32
max_identifier_lengt
h | 63
max_index_keys | 32
max_locks_per_transa
ction | 64
max_stack_depth | 2048
password_encryption | on
port | 5432
pre_auth_delay | 0
preload_libraries | unset
random_page_cost | 4
redirect_stderr | on
regex_flavor | advanced
rendezvous_name | unset
search_path | $user,public
server_encoding | UNICODE
server_version | 8.0.3
shared_buffers | 10240
silent_mode | off
sql_inheritance | on
ssl | off
statement_timeout | 0
stats_block_level | off
stats_command_string
| off
stats_reset_on_serve
r_start | on
stats_row_level | on
stats_start_collecto
r | on
superuser_reserved_c
onnections | 2
syslog_facility | LOCAL0
syslog_ident | postgres
TimeZone | PST8PDT
trace_notify | off
transaction_isolatio
n | read committed
transaction_read_onl
y | off
transform_null_equal
s | off
unix_socket_director
y | unset
unix_socket_group | unset
unix_socket_permissi
ons | 511
vacuum_cost_delay | 180
vacuum_cost_limit | 200
vacuum_cost_page_dir
ty | 20
vacuum_cost_page_hit
| 1
vacuum_cost_page_mis
s | 10
wal_buffers | 8
wal_sync_method | fdatasync
work_mem | 98304
zero_damaged_pages | off

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Jeffrey W. Baker

2005-08-19, 3:26 am

On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
> The disks are ext3 with journalling type of ordered, but this was later changed to writeback with no apparent change in speed.
>
> They're on a Dell poweredge 6650 with LSI raid card, setup as follows:
> 4 disks raid 10 for indexes (145GB) - sdc1
> 6 disks raid 10 for data (220GB) - sdd1
> 2 mirrored disks for logs - sdb1
>
> stripe size is 32k
> cache policy: cached io (am told the controller has bbu)
> write policy: write-back
> read policy: readahead


I assume you are using Linux 2.6. Have you considered booting your
machine with elevator=deadline? You can also change this at runtime
using sysfs.

These read speeds are not too impressive. Perhaps this is a slow
controller. Alternately you might need bigger CPUs.

There's a lot of possibilities, obviously :) I'd start with the
elevator, since that's easily tested.

-jwb


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Gavin Sherry

2005-08-19, 3:26 am

The query plan does *not* look okay.

> electric=# EXPLAIN ANALYZE
> electric-# SELECT datavalue, logfielddatatype, timestamp FROM logdata_recent
> electric-# WHERE (logfielddatatype = 70 OR logfielddatatype = 71 OR logfielddatatype = 69)
> electric-# AND graphtargetlog = 1327
> electric-# AND timestamp >= 1123052400 AND timestamp <= 1123138800
> electric-# ORDER BY timestamp;
> QUERY PLAN
> --------------------------------------------------
> Sort (cost=82.48..82.50 rows=6 width=14) (actual time=60208.968..60211.232 rows=2625 loops=1)
> Sort Key: public.logdata_recent."timestamp"
> -> Result (cost=0.00..82.41 rows=6 width=14) (actual time=52.483..60200.868 rows=2625 loops=1)
> -> Append (cost=0.00..82.41 rows=6 width=14) (actual time=52.476..60189.929 rows=2625 loops=1)
> -> Seq Scan on logdata_recent (cost=0.00..46.25 rows=1 width=14) (actual time=0.003..0.003 rows=0 loops=1)
> Filter: (((logfielddatatype = 70) OR (logfielddatatype = 71) OR (logfielddatatype = 69)) AND (graphtargetlog = 1327) AND ("timestamp" >= 1123052400) AND ("timestamp" <= 1123138800))
> -> Index Scan using logdata_recent_11230
85306_ix_t_fld_gtl, logdata_recent_11230
85306_ix_t_fld_gtl, logdata_recent_11230
85306_ix_t_fld_gtl on logdata_recent_stale
logdata_recent (cost=0.00..18.08 rows=3 width=14) (actual time=52.465..60

181.624 rows=2625 loops=1)

Notice here that expected rows is 3, but actual rows is a hell of a lot
higher. Try increasing stats collections for the columns on which
logdata_recent_11230
85306_ix_t_fld_gtl is declared.

Also, the actual index scan is taking a long time. How recently have you
vacuum full'd?

Thanks,

Gavin

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Roger Hand

2005-08-19, 3:26 am

Jeffrey W. Baker wrote:
> On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
>
> I assume you are using Linux 2.6.


Oops, sorry I left that out. Nope, we're on 2.4:

[root@rage-db2 ~]$ uname -a
Linux xxx.xxx.xxx 2.4.21-27.0.2.ELsmp #1 SMP Wed Jan 12 23:35:44 EST 2005 i686 i686 i386 GNU/Linux

It's RedHat Enterprise AS3.0 Fri Nov 5 17:55:14 PST 2004

> Have you considered booting your
> machine with elevator=deadline?


I just did a little Googling and see that the 2.4 kernel didn't have a decent elevator tuning system, and that was fixed in 2.6. Hmmm ....

Thanks for the ideas ...

-Roger

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Dan Harris

2005-08-20, 3:24 am


On Aug 19, 2005, at 12:55 AM, Jeffrey W. Baker wrote:

> On Tue, 2005-08-16 at 10:46 -0700, Roger Hand wrote:
> Have you considered booting your
> machine with elevator=deadline?


Although I'm not the OP for this problem, I thought I'd try it out.
WOW.. this should be in a Pg tuning guide somewhere. I added this to
my server tonight just for kicks and saw a pronounced improvement in
IO performance. Thank you very much for mentioning this on the list.

I didn't have a long enough maintenance window to do solid
benchmarking, but I can say for certain that the change was
noticeable, especially in VACUUM operations.

Specs for the server:

PG 8.0.1
Linux 2.6.12-3 kernel
4xOpteron 2.2
12GB RAM
16-drive RAID 10
XFS mounted with noatime
pg_xlog on separate RAID controller

-Dan


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql
.org so that your
message can get through to the mailing list cleanly

Michael Stone

2005-08-20, 7:23 am

On Sat, Aug 20, 2005 at 12:52:08AM -0600, Dan Harris wrote:
>On Aug 19, 2005, at 12:55 AM, Jeffrey W. Baker wrote:
>
>Although I'm not the OP for this problem, I thought I'd try it out.
>WOW.. this should be in a Pg tuning guide somewhere.

& #91;snip]
>16-drive RAID 10


Yeah, the default scheduler tries to optimize disk access patterns for a
single-spindle setup, and actually makes things worse if you have a
device with multiple spindles.

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Sponsored Links





Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive | Programming forum archive

Copyright 2008 droptable.com