Warp Speed screen scaling!

Zamaster · Post by **Zamaster** » Jul 24, 2014 19:06

Another bit of useful code I just wrote for this FBGD compo. To anyone familiar with ASM this is child's play but those who aren't might find it useful. This takes an FB image that is a multiple of 16 pixels wide and doubles the width and height, drawing it to the screen. I don't know if there is any way to do it faster than this and on my machine the render engine jumps from 15%-of-frame-time-spent-thinking to 17% (running at 30fps, the other percentage is spent sitting on its hands).

Most of the speed comes simply from moving 64 bytes a go, but some nice improvements comes from pre-empting the cache and then skipping the cache on write.

Code: Select all

sub scale2sync(img as uinteger ptr)
    dim as uinteger ptr pxlData
    dim as uinteger ptr scnptr
    dim as integer w, h
    imageinfo img,w,h,,,pxlData
    scnptr = screenptr
    screenlock
        asm
                mov         esi,        [pxlData]
                mov         edi,        [scnptr]
                
                mov         eax,        [w]
                mov         ebx,        [h]
                
                mov         edx,        eax
                shl         edx,        2           
                shr         eax,        4           
                shl         ebx,        1           
                
            row_copy:

                mov         ecx,        eax         

            col_copy:
            
                prefetchnta 64[esi]
                prefetchnta 96[esi]
                
                movdqa      xmm0,       0[esi]
                movaps      xmm1,       xmm0
                shufps      xmm0,       xmm0,       &b01010000
                shufps      xmm1,       xmm1,       &b11111010
                
                movdqa      xmm2,       16[esi]
                movaps      xmm3,       xmm2
                shufps      xmm2,       xmm2,       &b01010000
                shufps      xmm3,       xmm3,       &b11111010
                
                movdqa      xmm4,       32[esi]
                movaps      xmm5,       xmm4
                shufps      xmm4,       xmm4,       &b01010000
                shufps      xmm5,       xmm5,       &b11111010
                
                movdqa      xmm6,       48[esi]
                movaps      xmm7,       xmm6
                shufps      xmm6,       xmm6,       &b01010000
                shufps      xmm7,       xmm7,       &b11111010
                
                
                movntdq     0[edi],     xmm0
                movntdq     16[edi],    xmm1
                movntdq     32[edi],    xmm2
                movntdq     48[edi],    xmm3
                movntdq     64[edi],    xmm4
                movntdq     80[edi],    xmm5
                movntdq     96[edi],    xmm6
                movntdq     112[edi],   xmm7
                
                add         esi,        64
                add         edi,        128
            
                dec         ecx
                jnz         col_copy
                
                test        ebx, 1
                jnz         no_reset_row
                
                sub         esi,        edx    
                
            no_reset_row:
            
                dec         ebx
                jnz         row_copy
        
        end asm
    screenunlock
end sub

Rev5 · Post by **Rev5** » Jul 25, 2014 3:49

Success. I'm excited to test this further after I get home from my travels in a few days. The first code below is a quick demo. The second code below is the same demo after modifying your code to accept a destination image pointer instead defaulting to the screen. Thanks for sharing this :)

Code: Select all

anonymous1337 · Post by **anonymous1337** » Jul 25, 2014 17:51

I'd upvote this purely for the cleanliness of the ASM. I've actually never seen anyone tab it before.

codeFoil · Post by **codeFoil** » Jul 26, 2014 6:22

anonymous1337 wrote:I'd upvote this purely for the cleanliness of the ASM.

I second that.

I've heard tell of a time when tabbing assembly language was required, but that was long before x86.
I usually try to tab assembly into four fields. Label, mnemonic, operands, comment.

Zamaster · Post by **Zamaster** » Jul 29, 2014 18:11

Haha I'm glad someone noticed! Pre-academia my ASM styles in general were pretty butt. After being strong armed into making things look pretty I now do things like that habitually.

sean_vn · Post by **sean_vn** » Sep 08, 2014 3:06

Yes, nice format style. I will try to adopt it. When FreeBasic2 is released I will have so much fun.

MichaelW · Post by **MichaelW** » Sep 09, 2014 3:33

I coded a test app to do a cycle count for the assembly code, using allocated buffers with a 16-byte alignment:

Code: Select all

''=============================================================================
#include "crt/stddef.bi"
#include "counter.bas"
''=============================================================================
extern "c"
declare function _aligned_malloc(byval as size_t,byval as size_t)as any ptr
'' parameters: size, alignment
declare sub _aligned_free(byval as any ptr)
'' parameters: ptr
end extern
''=============================================================================

sub scale2sync naked(src as any ptr,dst as any ptr,w as integer,h as integer)
    asm
            push        ebx
            push        esi
            push        edi

            mov         esi,        [esp+16]
            mov         edi,        [esp+20]

            mov         eax,        [esp+24]
            mov         ebx,        [esp+28]

            mov         edx,        eax
            shl         edx,        2
            shr         eax,        4
            shl         ebx,        1

        row_copy:

            mov         ecx,        eax

        col_copy:

            'prefetchnta 64[esi]
            'prefetchnta 96[esi]

            movdqa      xmm0,       0[esi]
            movaps      xmm1,       xmm0
            shufps      xmm0,       xmm0,       &b01010000
            shufps      xmm1,       xmm1,       &b11111010

            movdqa      xmm2,       16[esi]
            movaps      xmm3,       xmm2
            shufps      xmm2,       xmm2,       &b01010000
            shufps      xmm3,       xmm3,       &b11111010

            movdqa      xmm4,       32[esi]
            movaps      xmm5,       xmm4
            shufps      xmm4,       xmm4,       &b01010000
            shufps      xmm5,       xmm5,       &b11111010

            movdqa      xmm6,       48[esi]
            movaps      xmm7,       xmm6
            shufps      xmm6,       xmm6,       &b01010000
            shufps      xmm7,       xmm7,       &b11111010

'            movntdq     0[edi],     xmm0
'            movntdq     16[edi],    xmm1
'            movntdq     32[edi],    xmm2
'            movntdq     48[edi],    xmm3
'            movntdq     64[edi],    xmm4
'            movntdq     80[edi],    xmm5
'            movntdq     96[edi],    xmm6
'            movntdq     112[edi],   xmm7

            movdqa     0[edi],     xmm0
            movdqa     16[edi],    xmm1
            movdqa     32[edi],    xmm2
            movdqa     48[edi],    xmm3
            movdqa     64[edi],    xmm4
            movdqa     80[edi],    xmm5
            movdqa     96[edi],    xmm6
            movdqa     112[edi],   xmm7

            add         esi,        64
            add         edi,        128

            dec         ecx
            jnz         col_copy

            test        ebx, 1
            jnz         no_reset_row

            sub         esi,        edx

        no_reset_row:

            dec         ebx
            jnz         row_copy

            pop         edi
            pop         esi
            pop         ebx

            ret         16
    end asm
end sub

''=============================================================================

#define W 320
#define H 240

dim as any ptr src, dst

src = _aligned_malloc( W*H*4, 16)
dst = _aligned_malloc( W*H*16, 16)

sleep 5000

for i as integer = 1 to 5

    counter_begin(1000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL)
        scale2sync(src,dst,W,H)
    counter_end()
    print counter_cycles

next

_aligned_free(dst)
_aligned_free(src)

sleep

The newer cycle-count macros are here.

These are my results running on a Core-i3 G3220, which IIRC is a (low end) fourth-generation Core processor:

Unmodified:

Code: Select all

Without prefetchnta:

Code: Select all

Without prefetchnta and with movdqu in place of movntdq:

Code: Select all

Without prefetchnta and with movdqa in place of movntdq:

Code: Select all

And I verified that my modifications did not break the code (at least when running on my processor) with this:

Code: Select all

sub scale2sync(img as uinteger ptr)
    dim as uinteger ptr pxlData
    dim as uinteger ptr scnptr
    dim as integer w, h
    imageinfo img,w,h,,,pxlData
    scnptr = screenptr
    screenlock
        asm
                mov         esi,        [pxlData]
                mov         edi,        [scnptr]

                mov         eax,        [w]
                mov         ebx,        [h]

                mov         edx,        eax
                shl         edx,        2
                shr         eax,        4
                shl         ebx,        1

            row_copy:

                mov         ecx,        eax

            col_copy:

                movdqa      xmm0,       0[esi]
                movaps      xmm1,       xmm0
                shufps      xmm0,       xmm0,       &b01010000
                shufps      xmm1,       xmm1,       &b11111010

                movdqa      xmm2,       16[esi]
                movaps      xmm3,       xmm2
                shufps      xmm2,       xmm2,       &b01010000
                shufps      xmm3,       xmm3,       &b11111010

                movdqa      xmm4,       32[esi]
                movaps      xmm5,       xmm4
                shufps      xmm4,       xmm4,       &b01010000
                shufps      xmm5,       xmm5,       &b11111010

                movdqa      xmm6,       48[esi]
                movaps      xmm7,       xmm6
                shufps      xmm6,       xmm6,       &b01010000
                shufps      xmm7,       xmm7,       &b11111010

                movdqa      0[edi],     xmm0
                movdqa      16[edi],    xmm1
                movdqa      32[edi],    xmm2
                movdqa      48[edi],    xmm3
                movdqa      64[edi],    xmm4
                movdqa      80[edi],    xmm5
                movdqa      96[edi],    xmm6
                movdqa      112[edi],   xmm7

                add         esi,        64
                add         edi,        128

                dec         ecx
                jnz         col_copy

                test        ebx, 1
                jnz         no_reset_row

                sub         esi,        edx

            no_reset_row:

                dec         ebx
                jnz         row_copy

        end asm
    screenunlock
end sub
screenres 640,480,32
dim as any ptr image = imagecreate(320,240,,32)
line image,(0,0)-(319,239)
line image,(0,239)-(319,0)
sleep
scale2sync(image)
sleep

I don't have any way to test this, but perhaps moving 32 bytes per instruction using AVX would be faster.

Zamaster · Post by **Zamaster** » Oct 06, 2014 8:28

Great! I guess I thought tryin' to play the cache was going to make it faster. Thanks!

Warp Speed screen scaling!

Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!

Re: Warp Speed screen scaling!