|
Home > Archive > MS SQL XML > January 2006 > Insert large amounts of data
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Insert large amounts of data
|
|
| Ryan Pedersen 2006-01-10, 8:24 pm |
| I need your help.
I have stored procedure that takes in an xml document and then inserts
data from that document into a single table. This stored procedure
needs to execute over 200,000 times a day.
Here is a snap shot of what the data looks like... Real data has about
100 instances of Parent Element per xml document and about 2 - 3 Child
Elements per Parent Element. There is also usually one Parent element
that will have about 50 - 80 Child Elements.
<RootElement>
<ParentElements>
<ParentElement>
<DataForParent>abc</DataForParent>
<ChildElement>
<ChildDataOne>first</ChildDataOne>
<ChildDataTwo>asdf</ChildDataTwo>
</ChildElement>
<ChildElement>
<ChildDataOne>two</ChildDataOne>
<ChildDataTwo>fdsafasdf</ChildDataTwo>
</ChildElement>
</ParentElement>
<ParentElement>
<DataForParent>xyz</DataForParent>
<ChildElement>
<ChildDataOne>three</ChildDataOne>
<ChildDataTwo>lkj lkj</ChildDataTwo>
</ChildElement>
<ChildElement>
<ChildDataOne>four</ChildDataOne>
<ChildDataTwo>Data Here</ChildDataTwo>
</ChildElement>
</ParentElement>
</ParentElements>
</RootElement>
So I have tried a few things with different results. Using the
execution plan to measure performance it looks like option C is the
fastest with an estimated operator cost of 2.645.
Option A:
First I used xquery to transform the data into a table.
Estimated Operator Cost = 330.365
select
cast(p.Row.query('../DataForParent/text()') as varchar(100)),
cast(p.Row.query(ChildDataOne/text()') as varchar(2083)),
cast(p.Row.query(ChildDataTwo/text()') as varchar(1000))
from
@xmlDoc.nodes('(/RootElement/ParentElements/ParentElement/ChildElement)')
AS p(Row)
Option B:
Option A performed so slowly that I went back to the drawing board and
tried to do the same thing using OPENXML.
Estimated Operator Cost = 17.245
DECLARE @iy int
exec sp_xml_preparedocume
nt @iy OUTPUT, @xmlDoc
SELECT
*
FROM
OPENXML(@iy, '/RootElement/ParentElements/ParentElement/ChildElement,
2)
WITH (parentdata varchar(100) '../DataForParent/text()',
dataone varchar(2083) ChildDataOne/text()',
datatwo varchar(1000) ChildDataTwo/text()')
Option C:
Finally I wrote my own User Defined Table Value Function to parse the
xml and return a table and believe it or not it is the fastest.
Estimated Operator Cost = 2.745
So what can I do to better the performance of this thing? I haven't
tried bulk loading the xml because I actually need to work on the data
before I just slam it into the database but I could engineer a process
if bulk loading the xml would drive the estimated operation cost down
into the 0.01 or 0.001 range.
Any thoughts or comments that you have are very welcome. Good, bad and
the ugly... bring it on.
Thanks
| |
| Kent Tegels 2006-01-11, 8:24 pm |
| Hello Ryan,
> I need your help.
> I have stored procedure that takes in an xml document and then inserts
> data from that document into a single table. This stored procedure
> needs to execute over 200,000 times a day.
You might want to take a look at using SIS instead, but frankly, I think
your C option is probably the best. You could partially address "A" by inserting
the whole document into single column into a scratch table that has a primary
xml index. This should improve net performance since the node table gets
build just once.
That said, I doubt it will be faster than your TVF.
Thank you,
Kent Tegels
DevelopMentor
http://staff.develop.com/ktegels/
| |
| Ryan Pedersen 2006-01-12, 8:24 pm |
| On Wed, 11 Jan 2006 14:38:58 -0800, Kent Tegels <ktegels@develop.com>
wrote:
>Hello Ryan,
>
>
>You might want to take a look at using SIS instead, but frankly, I think
>your C option is probably the best. You could partially address "A" by inserting
>the whole document into single column into a scratch table that has a primary
>xml index. This should improve net performance since the node table gets
>build just once.
>
>That said, I doubt it will be faster than your TVF.
>
>Thank you,
>Kent Tegels
>DevelopMentor
>http://staff.develop.com/ktegels/
>
I was thinking that I might change the process to be batch instead of
"real-time".
What I mean by that is...
Instead of taking each file and pushing that file into the database
one at a time (200,000+ time a day) I would put the data from about
1000 files into one file and then do a batch import into a scratch
table. Then I could use a triger (maybe) to move that data from the
scratch table into the real table.
Or maybe what you suggested about SSIS and I could use the same flat
file to move the data in using SSIS every 10 minutes or so.
Any comments on these thoughts? If I could batch the same data how
much of a performance increase could I get out of that if any?
| |
| Michael Rys [MSFT] 2006-01-12, 8:24 pm |
| You can improve your nodes method quite a bit by not using query() methods
but value() methods and avoid the parent axis:
select
p.Row. value('(DataForParen
t/text())[1]', 'varchar(100)'),
c.Row. value('(ChildDataOne
/text())& #91;1]','varchar(208
3)'),
c.Row. value('(ChildDataTwo
/text())[1]'), 'varchar(1000)')
from
@xmlDoc.nodes('(/RootElement/ParentElements/ParentElement) as p(Row)
cross apply p.Row. nodes('ChildElement)
')
AS c(Row)
Should be more efficient.
However, your hand coded TVF is most likely the most performant, but not
necessarily best scaling, and certainly the hardest to maintain solution.
Most performance because you special case the processing to the XML format
at hand.
Best regards
Michael
"Ryan Pedersen" <pedersenryan@gmail.com> wrote in message
news:p1b8s110vkgj2gb
529fnr8v2ai3k6gncps@
4ax.com...
>I need your help.
>
> I have stored procedure that takes in an xml document and then inserts
> data from that document into a single table. This stored procedure
> needs to execute over 200,000 times a day.
>
> Here is a snap shot of what the data looks like... Real data has about
> 100 instances of Parent Element per xml document and about 2 - 3 Child
> Elements per Parent Element. There is also usually one Parent element
> that will have about 50 - 80 Child Elements.
>
> <RootElement>
> <ParentElements>
> <ParentElement>
> <DataForParent>abc</DataForParent>
> <ChildElement>
> <ChildDataOne>first</ChildDataOne>
> <ChildDataTwo>asdf</ChildDataTwo>
> </ChildElement>
> <ChildElement>
> <ChildDataOne>two</ChildDataOne>
> <ChildDataTwo>fdsafasdf</ChildDataTwo>
> </ChildElement>
> </ParentElement>
> <ParentElement>
> <DataForParent>xyz</DataForParent>
> <ChildElement>
> <ChildDataOne>three</ChildDataOne>
> <ChildDataTwo>lkj lkj</ChildDataTwo>
> </ChildElement>
> <ChildElement>
> <ChildDataOne>four</ChildDataOne>
> <ChildDataTwo>Data Here</ChildDataTwo>
> </ChildElement>
> </ParentElement>
> </ParentElements>
> </RootElement>
>
> So I have tried a few things with different results. Using the
> execution plan to measure performance it looks like option C is the
> fastest with an estimated operator cost of 2.645.
>
> Option A:
> First I used xquery to transform the data into a table.
> Estimated Operator Cost = 330.365
>
> select
> cast(p.Row.query('../DataForParent/text()') as varchar(100)),
> cast(p.Row.query(ChildDataOne/text()') as varchar(2083)),
> cast(p.Row.query(ChildDataTwo/text()') as varchar(1000))
> from
> @xmlDoc.nodes('(/RootElement/ParentElements/ParentElement/ChildElement)')
> AS p(Row)
>
> Option B:
> Option A performed so slowly that I went back to the drawing board and
> tried to do the same thing using OPENXML.
> Estimated Operator Cost = 17.245
>
> DECLARE @iy int
>
> exec sp_xml_preparedocume
nt @iy OUTPUT, @xmlDoc
>
> SELECT
> *
> FROM
> OPENXML(@iy, '/RootElement/ParentElements/ParentElement/ChildElement,
> 2)
> WITH (parentdata varchar(100) '../DataForParent/text()',
> dataone varchar(2083) ChildDataOne/text()',
> datatwo varchar(1000) ChildDataTwo/text()')
>
> Option C:
> Finally I wrote my own User Defined Table Value Function to parse the
> xml and return a table and believe it or not it is the fastest.
> Estimated Operator Cost = 2.745
>
>
>
> So what can I do to better the performance of this thing? I haven't
> tried bulk loading the xml because I actually need to work on the data
> before I just slam it into the database but I could engineer a process
> if bulk loading the xml would drive the estimated operation cost down
> into the 0.01 or 0.001 range.
>
> Any thoughts or comments that you have are very welcome. Good, bad and
> the ugly... bring it on.
>
> Thanks
|
|
|
|
|