Arrays and IDs

New to FreeBASIC? Post your questions here.
Post Reply
Schnapsidee
Posts: 10
Joined: Jul 07, 2017 20:20

Arrays and IDs

Post by Schnapsidee »

Hey :)

I have a little problem. I am loading a big xml-file (about 4 GB) and inside there are some elements, which refer to other objects. Like this:

Code: Select all

<object1 id="123712346"/>
<object1 id="99174z417"/>

<object2>
	<nd ref="123712346"/>
</object2>
I have two UDTs. One for the object1, another one for object2:

Code: Select all

Type Object1
	As UInteger id
End Type

Type Object2
	As UInteger ids(Any)
End Type

' .... Parsing

For i As Integer=1 To number_of_object1
	If Object2.ids(index)=Object1(i).id Then 
		Object2.ids(index)=i
		Exit For
	End If
Next
Loading this is horribly slow. Every for-loop takes minutes (I got a few million object1s). Is there a faster way maybe? :)
Last edited by Schnapsidee on Jul 23, 2017 13:18, edited 2 times in total.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Arrays and IDs

Post by fxm »

Be more rigorous in your code snippet because "id" is defined as an Integer and "ids" as an Integer array in the Types, so the line syntax in the For Loop is not compatible.
dodicat
Posts: 7976
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Arrays and IDs

Post by dodicat »

Adding some values to your code, the actual looping for about 41 million objects (without an exit) is half a second here with the 64 bit compiler,and about .2 seconds with -gen gas (32 bit compiler)
If I go about 41 million X ten, then Win 10 seems to into pagefile mode here and it slows down greatly.

Code: Select all

 Type Object1
   As UInteger id
End Type

Type Object2
   As UInteger ids(Any)
End Type


dim as ulongint tot=40*1024*1024
print tot

redim as object1 object1(tot)
dim as object2 object2

redim object2.ids(tot)
for n as ulongint=1 to tot
    object1(n).id=123456 'set an initial value
    object2.ids(n)=123456
next

print "start"


' .... Parsing
var  number_of_object1=tot,index=0
dim as double t=timer
For i As ulongint=1 To number_of_object1
    index=i
   If Object2.ids(index)=Object1(i).id Then 
      Object2.ids(index)=i
      'Exit For
   End If
Next
print index
print "Time taken "; timer-t
print "A sample"
print Object2.ids(20000000)
sleep 
Schnapsidee
Posts: 10
Joined: Jul 07, 2017 20:20

Re: Arrays and IDs

Post by Schnapsidee »

Hmm, I guess the example wasn't that good. This is the actual code I am using:

Code: Select all

Type TNode 
	As Single lat,lon
	As UInteger id
End Type

Type TStreet
	As UInteger nodeid(Any)
	As UInteger typ			
	As Integer num_nodes
End Type

Dim Shared As TNode Node(Any)
Dim Shared As Integer num_nodes

Dim Shared As TStreet Street(Any)
Dim Shared As Integer num_streets

ff=FreeFile
Open file For Input As #ff
	Do
		Line Input #ff,lin
		
		If Instr(lin,"<node id=") Then
			tmpid=Val(Mid(lin,Instr(lin,"id=")+4,Instr(lin,"lat=")-6-Instr(lin,"id=")))
			tmplat=Val(Mid(lin,Instr(lin,"lat=")+5,Instr(lin,"lon=")-7-Instr(lin,"lat=")))
			tmplon=Val(Mid(lin,Instr(lin,"lon=")+5,Instr(lin,"version=")-7-Instr(lin,"lon=")))
			
			num_nodes+=1
			Redim Preserve Node(1 To num_nodes)
			Node(.num_nodes).lat=0-tmplat
			Node(.num_nodes).lon=tmplon
			Node(.num_nodes).id=tmpid
		End If
		
		'###############################################
		
		If Instr(lin,"<way id=") Then
			num_streets+=1
			Redim Preserve Street(1 To num_streets)
			With Street(num_streets)
				Do
					Line Input #ff,lin
					If Instr(lin,"<nd ref=") Then
						.num_nodes+=1
						Redim Preserve .nodeid(0 To .num_nodes)
						.nodeid(.num_nodes)=Val(Mid(lin,Instr(lin,"<nd ref=")+9,Instr(lin,"/>")-10-Instr(lin,"<nd ref=")))
						For i As Integer=1 To num_nodes
							If Node(i).id=.nodeid(.num_nodes) Then
								.nodeid(.num_nodes)=i
								Exit For
							End If
						Next
					Elseif Instr(lin,"highway") And Instr(lin,"residential") Then
						.typ=1
					End If
				Loop Until Instr(lin,"</way>")
			End With
		End If	
		
	Loop Until Eof(ff)
Close #ff
The important part comes after the ##. "Node" would be object1 and "Street" object2.
I have about 4 million nodes and about 5000 streets, every street has a reference to 2-8 nodes. I don't have a better idea of looping through 4 million nodes to find the right ones.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Arrays and IDs

Post by fxm »

The most disadvantageous for the execution time is the resizing of arrays at each iteration of the loop.
One can resizing the arrays not at each iteration of the loop, but sometimes only (with a big increment), and at the end adjust the final size of the arrays if necessary.

- Example of array resizing at each loop iteration:

Code: Select all

Dim As Integer array(Any)
Dim As Integer count
Dim As Double t
Dim As Integer Imax = 1000000

t = Timer
For I As Integer = 1 To Imax
  count += 1
  Redim Preserve array(1 to count)
  array(count) = I
Next I
Print Cast(Single, Timer - t);" s"
Print

For I As Integer = 1 To 5
  Print array(I)
Next I
Print " ..."
For I As Integer = Imax-5 To Imax
  Print array(I)
Next I

Sleep

Code: Select all

 1.239143042237174 s

 1
 2
 3
 4
 5
 ...
 999995
 999996
 999997
 999998
 999999
 1000000
- Example of array resizing sometimes only (but with a big increment):

Code: Select all

Dim As Integer array(Any)
Dim As Integer count
Dim As Double t
Dim As Integer Imax = 1000000

t = Timer
For I As Integer = 1 To Imax
  count += 1
  If Ubound(array) < count Then
    Redim Preserve array(1 to Ubound(array) + 100000)
  End If
  array(count) = I
Next I
Redim Preserve array(1 to count)
Print Cast(Single, Timer - t);" s"
Print

For I As Integer = 1 To 5
  Print array(I)
Next I
Print " ..."
For I As Integer = Imax-5 To Imax
  Print array(I)
Next I

Sleep

Code: Select all

 0.02491652610692441 s

 1
 2
 3
 4
 5
 ...
 999995
 999996
 999997
 999998
 999999
 1000000
Last edited by fxm on Aug 07, 2018 6:07, edited 2 times in total.
Schnapsidee
Posts: 10
Joined: Jul 07, 2017 20:20

Re: Arrays and IDs

Post by Schnapsidee »

It's a little bit faster, but not really much. Before, it took like 15 minutes to load, now it's 14 minutes. It's the for-loop to find the right Node, that makes it slow. But thank you for the tip. :)
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Arrays and IDs

Post by fxm »

Have you applied it to all your arrays?

The choice of the increment step value is important (it is not a linear improvement).
For my example (1 million of iterations):
10 => 1.135 s
50 => 1.158 s
200 => 1.166 s
1000 => 1.139 s
5000 => 0.251 s
20000 => 0.075 s
50000 => 0.035 s
100000 => 0.026 s
500000 => 0.015 s
5000000 => 0.011 s
Choose the largest but reasonable increment.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Arrays and IDs

Post by fxm »

Code: Select all

.....
         Dim As Integer pid = Instr(lin,"id=")
         Dim As Integer plat = Instr(lin,"lat=")
         Dim As Integer plon = Instr(lin,"lon=")
         tmpid=Val(Mid(lin,pid+4,plat-6-pid))
         tmplat=Val(Mid(lin,plat+5,plon-7-plat))
         tmplon=Val(Mid(lin,plon+5,Instr(lin,"version=")-7-plon))
.....
                  Dim As Integer pndref = Instr(lin,"<nd ref=")
                  .nodeid(.num_nodes)=Val(Mid(lin,pndref+9,Instr(lin,"/>")-10-pndref))
.....
Schnapsidee
Posts: 10
Joined: Jul 07, 2017 20:20

Re: Arrays and IDs

Post by Schnapsidee »

fxm, I measured the time it needs for the the for-loops. Yes, it is those loops that's slowing it down.
fxm wrote: Dim As Integer pndref = Instr(lin,"<nd ref=")
.nodeid(.num_nodes)=Val(Mid(lin,pndref+9,Instr(lin,"/>")-10-pndref))
Yes, those are the IDs of the nodes I need. The for-loop is searching the Node with these IDs.

Code: Select all

.nodeid(.num_nodes)=Val(Mid(lin,Instr(lin,"<nd ref=")+9,Instr(lin,"/>")-10-Instr(lin,"<nd ref=")))
For i As Integer=1 To num_nodes
	If Node(i).id=.nodeid(.num_nodes) Then
		.nodeid(.num_nodes)=i
		Exit For
	End If
Next
And it's horribly slow.
St_W
Posts: 1618
Joined: Feb 11, 2009 14:24
Location: Austria
Contact:

Re: Arrays and IDs

Post by St_W »

You could use some sort of hash table.
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: Arrays and IDs

Post by sancho2 »

Could you post a better example of the xml file?
I can't understand your code as it doesn't jive with the xml in the first post.
Schnapsidee
Posts: 10
Joined: Jul 07, 2017 20:20

Re: Arrays and IDs

Post by Schnapsidee »

Sure

Code: Select all

<bounds minlat="51.2234" minlon="9.4702" maxlat="51.3466" maxlon="9.8164002"/>
<node id="299975079" lat="51.2245452" lon="9.4803247" version="1"/>
<node id="492242" lat="51.2252" lon="9.4808962" version="1"/>
<node id="937604" lat="51.2274578" lon="9.4833375" version="1"/>
<node id="299975195" lat="51.2422024" lon="9.5053731" version="1"/>
<node id="492270" lat="51.2443431" lon="9.5092192" version="1"/>
<node id="937603" lat="51.2485041" lon="9.5158835" version="1"/>
<node id="301316609" lat="51.2620839" lon="9.5168436" version="1">

<way id="2732604" version="1">
	<nd ref="299975079"/>
	<nd ref="301316609"/>
	<nd ref="299975195"/>
</way>
<way id="4003654" version="1">
	<nd ref="937604"/>
	<nd ref="937603"/>
</way>
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: Arrays and IDs

Post by sancho2 »

A simple ordered list might speed up the for loop
You can try to split the nodes into a 2 dimension array, using the first digit of the id as the index of the first dimension.
Then you redim preserve the second dimension for the node count.

Code: Select all

Dim Shared As TNode Node(Any, Any)
...
Redim Preserve Node(0 to 9, 1 To num_nodes)
So in the test data you posted the first node '<node id="299975079"' would be stored in the array Node(2, 1) and the next node '<node id="492242"' would be at Node(4,1) etc.
Unless the id's begin predominately with only a couple of number, this will dramatically reduce the number of elements you have to search for in the for loop.

I also notice that you store an index to the matched node. You might consider storing a pointer to that node instead.

Also, FXM is showing you that the number of instr()'s inside the for loop can be reduced by using a variable.
Post Reply