bitarray

Post your FreeBASIC tips and tricks here. Please don’t post your code without including an explanation.
vdecampo
Posts: 2982
Joined: Aug 07, 2007 23:20
Location: Maryland, USA
Contact:

Postby vdecampo » May 09, 2011 13:09

TJF wrote:
vdecampo wrote:1000 Integer flags is only 1k of memory. If your are rendering a scene I would say it is even more important to deal in Integers for speed. Even 100,000 flags I would still use Integer.

-Vince

On my system 1000 INTEGERs is about 4k of memory.

In an 64-bit app 100,000 flags as INTEGER will be nearly 1MB, instead of the needed 12.5 kB.


Sorry, I forgot to multiply times 4! Still as an extreme case 1Mb out of a minimum of 4Gb is nothing. Only .025% of total RAM. I vote for speed. If I was writing an app to fly on a spacecraft where RAM is scarce then maybe I would opt for bits.

-Vince
marcov
Posts: 2939
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Postby marcov » May 09, 2011 14:09

MichaelW wrote:
vdecampo wrote:I personally just use Integers to store flags. Memory is plentiful and the CPU works with Integers faster anyway.

Same here. Even if you ignore all the memory overhang built into the OS (or at least into Windows), unless you have a very large number of flags the memory savings from using single-bit flags is negligible.


It's not just memory savings. It is also when taking copies, the bitpacked arrays are faster. (less bytes copied is faster, and so is the initial initialization)

Moreover, I don't think the integer instructions are that much faster. (since most will be cache bound anyway). You might never earn back the slower initialization.
TJF
Posts: 3546
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Postby TJF » May 09, 2011 20:35

Not every developer is writing games and not every computer has 4GB.

Think of damons, running in the background. Think of DOS users, limited to some kB.

If you like them today or not -- bit arrays is useful.
vdecampo
Posts: 2982
Joined: Aug 07, 2007 23:20
Location: Maryland, USA
Contact:

Postby vdecampo » May 09, 2011 22:02

TJF wrote:Not every developer is writing games and not every computer has 4GB.

Think of damons, running in the background. Think of DOS users, limited to some kB.

If you like them today or not -- bit arrays is useful.


That response was based on the 64-bit OS comment. Yes, if memory is sparce, bit operations are necessary. Clever code is not always the best for a situation. It was in the situation this bitarray was suggested for that I forward the Integer option.

-Vince
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Postby MichaelW » May 10, 2011 4:17

marcov wrote:It's not just memory savings. It is also when taking copies, the bitpacked arrays are faster. (less bytes copied is faster, and so is the initial initialization)

The copies will be much faster. For the initial initialization it depends on how you do it. Doing it as a copy operation will be much faster, but doing it through the bit access mechanism will be slower. It did occur to me that for random access, the more compact bit array might have a significant speed advantage.
marcov
Posts: 2939
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Postby marcov » May 10, 2011 7:30

MichaelW wrote:
marcov wrote:It's not just memory savings. It is also when taking copies, the bitpacked arrays are faster. (less bytes copied is faster, and so is the initial initialization)

The copies will be much faster. For the initial initialization it depends on how you do it. Doing it as a copy operation will be much faster,


The bitarray will be copied using memcpy in the same way as the integer one. It will just have 1/32 of the bytes. Note that you always have one such operation (initialization, filling with zeroes, test if set contains any elements)

but doing it through the bit access mechanism will be slower.


I'm not sure that is measurable. I already showed above that bitarrays can be also one and two-instruction sequences, though more expensive one (bt* instead of mov).

It did occur to me that for random access, the more compact bit array might have a significant speed advantage.


There are cache related aspects there too yes. But the impacts are hard to predict, and probably depend a lot on usage. That is also why beginner benchmarking like putting such instruction in a 10 million loop is pretty useless here. Since such benchmarks often totally ignore the fact that include/exclude operations are not the only operations and allow the array to be fully cached after a few ops.

As a rule of thumb, bigger datasize is more expensive (e.g. 32-bit apps become about 5-10% slower by just recompiling them for 64-bit, if all circumstances are kept the same(*). Very data intensive apps more)

Still that is not exactly the same as this, since this has different instruction selection. (and the tradeoffs of that are different for every processor generation). I haven't benchmarked such instructions myself since 486 times, so while I lean towards the bitarray solution being better overall, that is mostly based on the cheaper init, copy and passing by value, and assume the base ops are in the same category (cache effects in elementary operations vs more expensive instructions average out on typical usage).

Now that I think of it, testing if a set is zero is another such special complex operation that happens enough to optimize it (e.g. with a repne scasb/l)

I mostly use (relative small) bitsets in nearly all businesscode, and I know there they are much faster as bitoperations (since now they registerable and easily passed to other procedures).

Pascal bitsets have long been limited to 256, bigger ones are based on a classwrapper around an array of <native word size> (bitpacked, but always a multiple of wordsize). I still hope I'll get native sets> 256 one day. (but probably only if I do it myself).

The main reason for having bigger native set is simply that when you use an enum as e.g. a state, and use sets of the enum as masks, you now have to rewrite code as soon as the enum hits 256 elements. That is an annoying limitation.

(*) one has to be very careful interpreting benchmarks here. The minimum SSE2 level of AMD64 accelerates some procedures heavily that are often rate determining. While you can often simply turn on SSE2 for the compiled code, you can't make handcoded routines in the runtime change
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Postby MichaelW » May 11, 2011 3:32

This is an attempt to compare the bit array access time to the integer array access time, for sequential access and random access.

Code: Select all

'==============================================================================
#include "windows.bi"
'==============================================================================

function bit_get( p as byte ptr, index as integer ) as integer
    return ((p[index shr 3]) and (1 shl (index and 7))) <> 0
end function

sub bit_set( p as byte ptr, index as integer )
    p[index shr 3] or= (1 shl (index and 7))
end sub

sub bit_clear( p as byte ptr, index as integer )
    p[index shr 3] xor= (1 shl (index and 7))
end sub

'==============================================================================

function int_get( p as integer ptr, index as integer ) as integer
    return p[index] <> 0
end function

sub int_set( p as integer ptr, index as integer )
    p[index] = 1
end sub

sub int_clear( p as integer ptr, index as integer )
    p[index] = 0
end sub

'==============================================================================


dim as double t1, t2, bit_time, int_time
dim as integer random_index, nflags = 100000
dim as byte ptr pb = callocate( (nflags * 32) \ 8 )
dim as integer ptr pi = callocate( (nflags * 32) * 4 )

SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS )

sleep 3000

print "sequential access:"
print " nflags  ";nflags

t1 = timer
for i as integer = 0 to nflags - 1
    bit_set( pb, i )
    bit_get( pb, i )
    bit_clear( pb, i )
    bit_get( pb, i )
next
t2 = timer
bit_time = t2 - t1
print using " bit_time ##.###s"; bit_time

t1 = timer
for i as integer = 0 to nflags - 1
    int_set( pi, i )
    int_get( pi, i )
    int_clear( pi, i )
    int_get( pi, i )
next
t2 = timer
int_time = t2 - t1
print using " int_time ##.###s"; int_time
print
print "random access:"
print " nflags   bit_time  int_time   bit_time/int_time"
for i as integer = 1 to 6
    t1 = timer
    for j as integer = 0 to nflags - 1
        random_index = rnd * nflags
        bit_set( pb, random_index )
        bit_get( pb, random_index )
        random_index = rnd * nflags
        bit_clear( pb, random_index )
        bit_get( pb, random_index )
    next
    t2 = timer
    bit_time = t2 - t1
    t1 = timer
    for j as integer = 0 to nflags - 1
        random_index = rnd * nflags
        int_set( pi, random_index )
        int_get( pi, random_index )
        random_index = rnd * nflags
        int_clear( pi, random_index )
        int_get( pi, random_index )
    next
    t2 = timer
    int_time = t2 - t1
    print nflags;
    print using "  ##.###s"; bit_time,
    print using "    ##.###s"; int_time,
    print using "        ##.###"; bit_time / int_time
    nflags shl= 1
next

sleep

Code: Select all

sequential access:
 nflags   100000
 bittime  0.047s
 inttime  0.022s

random access:
 nflags   bit_time  int_time   bit_time/int_time
 100000   0.077s     0.061s         1.267
 200000   0.162s     0.150s         1.086
 400000   0.336s     0.344s         0.979
 800000   0.684s     0.730s         0.937
 1600000   1.383s     1.444s         0.958
 3200000   2.816s     2.725s         1.033
marcov
Posts: 2939
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Postby marcov » May 11, 2011 19:21

MichaelW wrote:This is an attempt to compare the bit array access time to the integer array access time, for sequential access and random access.
function bit_get( p as byte ptr, index as integer ) as integer
return ((p[index shr 3]) and (1 shl (index and 7))) <> 0
end function


Thanks. That at least gives some reference point. I note that the bitarray is already quite close with very suboptimal implementation.

It would probably need some intrinsics implemented into the compiler that translate to the bt* instructions on e.g. x86.
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Postby MichaelW » May 12, 2011 0:15

I think this is now about the fourth revision. I decided that returning the value of a bit as 0 or -1 is not appropriate, and changed it to 0 or 1.

Code: Select all

'==============================================================================
#include "windows.bi"
#include "counter.bas"
'==============================================================================
'' Counter.bas is available here:
''
'' http://www.freebasic.net/forum/viewtopic.php?t=4221#4221
'==============================================================================

sub bit_set naked( p as ubyte ptr, index as integer )
    asm
        mov edx, [esp+4]
        mov eax, [esp+8]
        bts DWORD PTR [edx], eax
        ret 8
    end asm
end sub

sub bit_reset naked( p as ubyte ptr, index as integer )
    asm
        mov edx, [esp+4]
        mov eax, [esp+8]
        btr DWORD PTR [edx], eax
        ret 8
    end asm
end sub

function bit_get naked( p as ubyte ptr, index as integer ) as integer
    asm
        mov edx, [esp+4]
        mov ecx, [esp+8]
        xor eax, eax
        bt  DWORD PTR [edx], ecx
        adc eax, 0
        ret 8
    end asm
end function

'==============================================================================

sub int_set naked( p as integer ptr, index as integer )
    asm
        mov edx, [esp+4]
        mov eax, [esp+8]
        mov DWORD PTR [edx+eax*4], 1
        ret 8
    end asm
end sub

sub int_reset naked( p as integer ptr, index as integer )
    asm
        mov edx, [esp+4]
        mov eax, [esp+8]
        mov DWORD PTR [edx+eax*4], 0
        ret 8
    end asm
end sub

function int_get naked( p as integer ptr, index as integer ) as integer
    asm
        mov edx, [esp+4]
        mov ecx, [esp+8]
        mov eax, [edx+ecx*4]
        ret 8
    end asm
end function

'==============================================================================

'------------------------
'' 100000000 flags each.
'------------------------

dim as ubyte ptr pb = callocate(100000000 \ 8)
dim as integer ptr pi = callocate(100000000 * 4)
dim as integer r

'==============================================================================

for i as integer = 0 to 100000000-1 step 8
    bit_set( pb, i + 0 )
    bit_set( pb, i + 2 )
    bit_set( pb, i + 4 )
    bit_set( pb, i + 6 )
    if pb[i shr 3] <> &b01010101 then print "error"
    if bit_get( pb, i + 0 ) <> 1 then print "error"
    if bit_get( pb, i + 2 ) <> 1 then print "error"
    if bit_get( pb, i + 4 ) <> 1 then print "error"
    if bit_get( pb, i + 6 ) <> 1 then print "error"
    bit_reset( pb, i + 0 )
    bit_reset( pb, i + 2 )
    bit_reset( pb, i + 4 )
    bit_reset( pb, i + 6 )
    if pb[i shr 3] <> 0 then print "error"
    if bit_get( pb, i + 0 ) <> 0 then print "error"
    if bit_get( pb, i + 2 ) <> 0 then print "error"
    if bit_get( pb, i + 4 ) <> 0 then print "error"
    if bit_get( pb, i + 6 ) <> 0 then print "error"
    bit_set( pb, i + 1 )
    bit_set( pb, i + 3 )
    bit_set( pb, i + 5 )
    bit_set( pb, i + 7 )
    if pb[i shr 3] <> &b10101010 then print "error"
    if bit_get( pb, i + 1 ) <> 1 then print "error"
    if bit_get( pb, i + 3 ) <> 1 then print "error"
    if bit_get( pb, i + 5 ) <> 1 then print "error"
    if bit_get( pb, i + 7 ) <> 1 then print "error"
next
print

'==============================================================================

sleep 3000

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_set( pb, 0 )
counter_end
print "bit_set", counter_cycles; " cycles first"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_reset( pb, 0 )
counter_end
print "bit_reset", counter_cycles; " cycles first"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_get( pb, 0 )
counter_end
print "bit_get", counter_cycles; " cycles first"
print

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_set( pb, 999999 )
counter_end
print "bit_set", counter_cycles; " cycles last"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_reset( pb, 999999 )
counter_end
print "bit_reset", counter_cycles; " cycles last"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    bit_get( pb, 999999 )
counter_end
print "bit_get", counter_cycles; " cycles last"
print

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_set( pi, 0 )
counter_end
print "int_set", counter_cycles; " cycles first"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_reset( pi, 0 )
counter_end
print "int_reset", counter_cycles; " cycles first"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_get( pi, 0 )
counter_end
print "int_get", counter_cycles; " cycles first"
print

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_set( pi, 999999 )
counter_end
print "int_set", counter_cycles; " cycles last"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_reset( pi, 999999 )
counter_end
print "int_reset", counter_cycles; " cycles last"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    int_get( pi, 999999 )
counter_end
print "int_get", counter_cycles; " cycles last"
print

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    asm
        mov edx, [pb]
        mov eax, 999999
        bts DWORD PTR [edx], eax
    end asm
counter_end
print "bit_set", counter_cycles; " cycles inline"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    asm
        mov edx, [pb]
        mov eax, 999999
        btr DWORD PTR [edx], eax
    end asm
counter_end
print "bit_reset", counter_cycles; " cycles inline"

counter_begin( 1000, HIGH_PRIORITY_CLASS )
    asm
        mov edx, [pb]
        mov ecx, 999999
        xor eax, eax
        bt  DWORD PTR [edx], ecx
        sbb eax, 0
        mov [r], eax
    end asm
counter_end
print "bit_get", counter_cycles; " cycles inline"
print

'==============================================================================

dim as double t1, t2, bit_time, int_time
dim as integer random_index, nflags = 100000

SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS )

print "sequential access:"
print " nflags  ";nflags

t1 = timer
for i as integer = 0 to nflags - 1
    bit_set( pb, i )
    bit_get( pb, i )
    bit_reset( pb, i )
    bit_get( pb, i )
next
t2 = timer
bit_time = t2 - t1
print " bit_time";
print using " ##.###s"; bit_time

t1 = timer
for i as integer = 0 to nflags - 1
    int_set( pi, i )
    int_get( pi, i )
    int_reset( pi, i )
    int_get( pi, i )
next
t2 = timer
int_time = t2 - t1
print " int_time";
print using " ##.###s"; int_time
print
print "random access:"
print " nflags   bit_time  int_time   bit_time/int_time"
for i as integer = 1 to 4
    t1 = timer
    for j as integer = 0 to nflags - 1
        random_index = rnd * nflags
        bit_set( pb, random_index )
        bit_get( pb, random_index )
        random_index = rnd * nflags
        bit_reset( pb, random_index )
        bit_get( pb, random_index )
    next
    t2 = timer
    bit_time = t2 - t1
    t1 = timer
    for j as integer = 0 to nflags - 1
        random_index = rnd * nflags
        int_set( pi, random_index )
        int_get( pi, random_index )
        random_index = rnd * nflags
        int_reset( pi, random_index )
        int_get( pi, random_index )
    next
    t2 = timer
    int_time = t2 - t1
    print nflags;
    print using "  ##.###s"; bit_time,
    print using "    ##.###s"; int_time,
    print using "        ##.###s"; bit_time / int_time
    nflags shl= 1
next

sleep

Running on a P3:

Code: Select all

bit_set       15 cycles first
bit_reset     15 cycles first
bit_get       15 cycles first

bit_set       16 cycles last
bit_reset     16 cycles last
bit_get       16 cycles last

int_set       15 cycles first
int_reset     15 cycles first
int_get       9 cycles first

int_set       14 cycles last
int_reset     14 cycles last
int_get       10 cycles last

bit_set       6 cycles inline
bit_reset     6 cycles inline
bit_get       9 cycles inline

sequential access:
 nflags   100000
 bit_time  0.011s
 int_time  0.010s

random access:
 nflags   bit_time  int_time   bit_time/int_time
 100000   0.044s     0.053s         0.838s
 200000   0.092s     0.132s         0.694s
 400000   0.188s     0.311s         0.605s
 800000   0.380s     0.670s         0.567s

The bit test instructions are slower than the fast (typically < 1 cycle) instructions, but still relatively fast considering what they do and the number of instructions that they can replace. I expected them to be slower than they are, given that they operate on the carry flag. I would be interested to see the cycle counts for some recent processors.

With sequential access the integer array code is slightly faster, but I now think not enough faster to justify a 32x increase in memory usage. For random access, and the effect it has on caching, and relatively large arrays, there is no contest - access to the smaller bit array is much faster.
marcov
Posts: 2939
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Postby marcov » May 18, 2011 12:33

MichaelW wrote:I think this is now about the fourth revision. I decided that returning the value of a bit as 0 or -1 is not appropriate, and changed it to 0 or 1.


Then you can use the 386+ SET instruction (setc iirc) to convert flag into register.

setc %al

instead of the adc trick.
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Postby MichaelW » May 18, 2011 14:33

marcov wrote:Then you can use the 386+ SET instruction (setc iirc) to convert flag into register.

setc %al

instead of the adc trick.

Thanks for the suggestion. I normally don’t even try the SETcc instructions because I have never seen an instance where they made my code faster, and I have seen many instances were avoiding partial registers did. Running on a P3 changing the bit_get function to this:

Code: Select all

function bit_get naked( p as ubyte ptr, index as integer ) as integer
    asm
        mov edx, [esp+4]
        mov ecx, [esp+8]
        xor eax, eax
        bt  DWORD PTR [edx], ecx
        setc al
        ret 8
    end asm
end function

Did not change the cycle count. Agner Fog does recommend using SETcc to replace LAHF on a P4.
1000101
Posts: 2556
Joined: Jun 13, 2005 23:14
Location: SK, Canada

Postby 1000101 » May 20, 2011 23:12

TJF wrote:Macros blow up the binaries, if you use them alot.
Depends on how it's used. Most of the time the macros are resolved to very few instructions.

TJF wrote:They need global variables. That's why they're unsave in big projects.
Simply false. They require global variables no more then your implementation does.

TJF wrote:I haven't seen a macro solution using an index and being prepared to handle more than 64 entries.
Then you didn't look very hard. My solution can handle up to the limits of the memory installed in the system and the limits of the system addressing.

TJF wrote:I haven't seen a macro solution allowing to redim the bit array.
That's because they all leave handling of the memory to the user.

TJF wrote:Mine is a clean OOP solution without these restrictions.
None of the "restrictions" you claim are valid or true. Yours is simply another implementation and that is all you can claim.



btw, 1000 integers on a 64-bit system is only 8000 bytes. No where near 1MB.
TJF
Posts: 3546
Joined: Dec 06, 2009 22:27
Location: N47°, E15°
Contact:

Postby TJF » May 21, 2011 8:09

1000101 wrote:btw, 1000 integers on a 64-bit system is only 8000 bytes.

1000 integers on a 64-bit system is 64,000 bytes.

1000101 wrote:None of the "restrictions" you claim are valid or true...
Then you didn't look very hard...
Simply false...
Depends on how it's used...

It seems that we're speaking about different issues. When I compare solutions using macros or OOP, I speak about solutions with silimar flexibility and safety.

BTW:
1000101 wrote:My solution can handle up to the limits of the memory installed in the system and the limits of the system addressing.

You may open the source, so that we can make our minds?
MichaelW
Posts: 3500
Joined: May 16, 2006 22:34
Location: USA

Postby MichaelW » May 21, 2011 9:14

TJF wrote:
1000101 wrote:btw, 1000 integers on a 64-bit system is only 8000 bytes.

1000 integers on a 64-bit system is 64,000 bytes.

How do you get 64,000 bytes from 1000 8-byte integers? Assuming that the array base is aligned, 1000 64-bit integers aligned on the natural boundaries will require 8000 bytes.
Galeon
Posts: 563
Joined: Apr 08, 2009 5:30
Location: Philippines
Contact:

Postby Galeon » May 21, 2011 12:03

He meant bits :).

Return to “Tips and Tricks”

Who is online

Users browsing this forum: No registered users and 8 guests