remove duplicates intries in an array

Post your FreeBASIC tips and tricks here. Please don’t post your code without including an explanation.
MrSwiss
Posts: 3829
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Postby MrSwiss » May 03, 2021 17:17

Since strings are distinctly different from numeric/boolean data-types ...
They should therefore be treated somewhat differently too.
Because it should work correctly independent of "casing" (lower-/upper-/mixed-case),
we explicitly compare equally cased strings.
Below code is explicitly string specific and caters for two scenarios:
  • default: mixed lengths of strings (in array), mostly expected.
  • otherwise: equal lengths of strings expected/known (depends on input, e.g. *.csv file)
In default mode a two step approach is used (for speed). The method is called: "quit early".
(NO chronological order of the elements attempted.):

Code: Select all

Sub MakeUnique( _                       ' case insensitive check
          a()   As String,        _
    ByVal chlen As Boolean = TRUE _     ' default: do len check first!
    )
    Var i = LBound(a), j = 0, _
        k = UBound(a)

    If chlen Then                       ' faster if: mixed string length
        While i < k                     ' only if both string lengths are equal _
            j = i + 1                   ' we do a full string comparison, other- _
            While j <= k                ' wise: quit early
                If Len(a(i)) = Len(a(j)) Then
                    If LCase(a(i)) = LCase(a(j)) Then
                        a(j) = a(k)
                        k -= 1
                    Else
                        j += 1
                    End If
                Else
                    j += 1
                End If
            Wend
            i += 1
        Wend
    Else                                ' faster if: strings are mostly/always _
        While i < k                     ' of the same length
            j = i + 1
            While j <= k
                If LCase(a(i)) = LCase(a(j)) Then
                    a(j) = a(k)
                    k -= 1
                Else
                    j += 1
                End If
            Wend
            i += 1
        Wend
    End If
    Redim Preserve a(LBound(a) To k)    ' remove 'empties'
End Sub
Sorting should be dealt with, in a dedicated procedure (IMO).
aloberoger
Posts: 502
Joined: Jan 13, 2009 19:23

Re: remove duplicates intries in an array

Postby aloberoger » May 03, 2021 18:15

ok it's good now, i will test the speed later
MrSwiss
Posts: 3829
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Postby MrSwiss » May 04, 2021 11:58

aloberoger wrote:ok it's good now, i will test the speed later
Just remember that, at the top of the priority-list is: QUALITY, before anything else (including speed).

All the speed in the world is useless if the code produces nonsense even if only sporadically.
(Those are typically the worst bugs to find and thereafter fix.)
counting_pine
Site Admin
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Postby counting_pine » May 04, 2021 12:46

I wonder if skipping the length check will save enough time - compared to the lower case conversions and comparison - to be worth having an optimised routine for.
I think a better optimisation would be to use a case-insensitive comparison routine, even if you have to implement it yourself. This saves having to copy/convert both strings, and makes a preliminary length check unnecessary.

Code: Select all

cat /tmp/equal.bas
function equal_lcase(byref s1 as const string, byref s2 as const string) as boolean
  if len(s1) <> len(s2) then return false
  for i as integer = 0 to len(s1)-1
    if s1[i] = s2[i] then
      '' same ASCII value
    elseif ((s1[i] xor s2[i]) and not 32) = 0 then
      if cunsg((s1[i] and not 32) - asc("A")) < 26 then
        '' same letter, different case
      else
        '' not a letter
        return false
      end if
    else
      '' not equal
      return false
    end if
  next i
  return true
end function

assert(equal_lcase("", ""))
assert(equal_lcase("ABCdef", "abcDEF"))
assert(equal_lcase("Hello", "hello"))

assert(not equal_lcase("123", "1234"))
assert(not equal_lcase("A", "B"))
assert(not equal_lcase("A", "b"))

for i as integer = 0 to 255
  assert(equal_lcase(chr(i), chr(i)))

  select case (i and not 32)
  case asc("A") to asc("Z")
    assert(equal_lcase(chr(i), chr(i xor 32)))
  case else
    assert(not equal_lcase(chr(i), chr(i xor 32)))
  end select
next i
MrSwiss
Posts: 3829
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Postby MrSwiss » May 04, 2021 13:34

counting_pine wrote:I wonder if skipping the length check will save enough time
The skipping of the len-check is exclusively for: "known to be equal len" strings (as from a formated *.csv or similar).
NOT for the typical string array containing strings of arbitrary length.

The lenght check is "optimisation" because it skips a lot of those string conversions/comparisons ...
Probably in the region of reducing them (in a typicall string-array) to <= 10%, skipping >= 90%.
(Just to explain the logic I'm employing.)

You seem to go far deeper into details here. I'll look into that later too.
counting_pine
Site Admin
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Postby counting_pine » May 04, 2021 14:00

Sorry, I phrased my suggestion badly, and made it sound like I meant the opposite of what I intended.
I agree, the length check is a great time saver. I also suggest that its cost is negligible enough that it’s not worth duplicating most of the code just for the purpose of skipping it.
MrSwiss
Posts: 3829
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Postby MrSwiss » May 07, 2021 18:14

@counting_pine, great idea/job ... thank you.

I've finally dug into your code, optimized it to the last CPU clock-tick and inserted it into MakeUnique():
  • offhand speed gain: roughly a factor 10 (not timed, could be more)
  • test-file size: about 400 KB (type: *.csv)
  • twice tested:
      - first: splitted into lines (equal string length, about 4'300 lines)
      - second: splitted into words (different string length, much shorter, roughly 48'000 words)
The code now:

Code: Select all

' was: Function equal_lcase() -- original code by counting_pine
' renamed and recoded (for speed) -- MrSwiss
Function Equal_XCase( _                 ' case insensitive string comp.
    ByRef s1    As Const String, _
    ByRef s2    As Const String  _
    ) As Boolean
    If Len(s1) <> Len(s2) Then Return FALSE

    For i As UInteger = 0 to Len(s1) - 1
        If s1[i] = s2[i] Then
            ' same ASCII value
        ElseIf ((s1[i] Xor s2[i]) And -33) = 0 Then
            If CUnsg((s1[i] And -33) - 65) < 26 Then
                ' same letter, different case
            Else
                ' not a letter
                Return FALSE
            End If
        Else
            ' not equal
            Return FALSE
        End If
    Next
    ' all checks passed = are equal (except: case may differ)
    Return TRUE
End Function

' string treatment differs from numeric data-types ...
Private Sub MakeUnique OverLoad( _      ' case insensitive check
          a()   As String _
    )
    Var i = LBound(a), j = 0, _
        k = UBound(a)

    While i < k
        j = i + 1
        While j <= k
            If Equal_XCase(a(i), a(j)) Then ' string equality check _
                a(j) = a(k) : k -= 1    ' case insensitive comparison
            Else
                j += 1
            End If
        Wend
       
        i += 1
    Wend
    Redim Preserve a(LBound(a) To k)    ' remove 'empties'
End Sub
counting_pine
Site Admin
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Postby counting_pine » May 08, 2021 14:14

Another potential speedup, although slightly unintuitive, would be to use Swap instead of assignments to reorder the array.
Swapping string descriptors leaves the string data unchanged, and saves having to create or destroy new strings.

In conjunction with the string comparison function, that should mean that no string allocations at all are required inside the routine.
MrSwiss
Posts: 3829
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Postby MrSwiss » May 08, 2021 16:37

I've tested your suggestion using: Swap ...
This time I've implemented timing, expecting a smallish difference only. No optimisations used.

Result: the difference (if any) varies more in-between runs, than in-between implementations ...
Leading to the conclusion: Swap doesn't influence the result in a measurable way.
(this may in a different context be very much different)

This proves clearly, that proper testing always beats (possibly erroneous) assumptions.
fxm
Moderator
Posts: 10375
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: remove duplicates intries in an array

Postby fxm » May 08, 2021 18:18

It might be necessary to test this with many array elements containing long strings to see the difference.
counting_pine
Site Admin
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Postby counting_pine » May 09, 2021 12:44

I guess if there are O(n^2) comparisons and O(n) assignments, then the difference is likely to be negligible in many cases, except where most strings are long (comparable in size to the array length), and “obviously” different (i.e. different lengths or prefixes).

Return to “Tips and Tricks”

Who is online

Users browsing this forum: No registered users and 2 guests