remove duplicates intries in an array

Post your FreeBASIC source, examples, tips and tricks here. Please don’t post code without including an explanation.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Post by MrSwiss »

Since strings are distinctly different from numeric/boolean data-types ...
They should therefore be treated somewhat differently too.
Because it should work correctly independent of "casing" (lower-/upper-/mixed-case),
we explicitly compare equally cased strings.
Below code is explicitly string specific and caters for two scenarios:
  • default: mixed lengths of strings (in array), mostly expected.
  • otherwise: equal lengths of strings expected/known (depends on input, e.g. *.csv file)
In default mode a two step approach is used (for speed). The method is called: "quit early".
(NO chronological order of the elements attempted.):

Code: Select all

Sub MakeUnique( _                       ' case insensitive check
          a()   As String,        _
    ByVal chlen As Boolean = TRUE _     ' default: do len check first!
    )
    Var i = LBound(a), j = 0, _
        k = UBound(a)

    If chlen Then                       ' faster if: mixed string length
        While i < k                     ' only if both string lengths are equal _
            j = i + 1                   ' we do a full string comparison, other- _
            While j <= k                ' wise: quit early
                If Len(a(i)) = Len(a(j)) Then
                    If LCase(a(i)) = LCase(a(j)) Then
                        a(j) = a(k)
                        k -= 1
                    Else
                        j += 1
                    End If
                Else
                    j += 1
                End If
            Wend
            i += 1
        Wend
    Else                                ' faster if: strings are mostly/always _
        While i < k                     ' of the same length
            j = i + 1
            While j <= k
                If LCase(a(i)) = LCase(a(j)) Then
                    a(j) = a(k)
                    k -= 1
                Else
                    j += 1
                End If
            Wend
            i += 1
        Wend
    End If
    Redim Preserve a(LBound(a) To k)    ' remove 'empties'
End Sub
Sorting should be dealt with, in a dedicated procedure (IMO).
aloberoger
Posts: 507
Joined: Jan 13, 2009 19:23

Re: remove duplicates intries in an array

Post by aloberoger »

ok it's good now, i will test the speed later
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Post by MrSwiss »

aloberoger wrote:ok it's good now, i will test the speed later
Just remember that, at the top of the priority-list is: QUALITY, before anything else (including speed).

All the speed in the world is useless if the code produces nonsense even if only sporadically.
(Those are typically the worst bugs to find and thereafter fix.)
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Post by counting_pine »

I wonder if skipping the length check will save enough time - compared to the lower case conversions and comparison - to be worth having an optimised routine for.
I think a better optimisation would be to use a case-insensitive comparison routine, even if you have to implement it yourself. This saves having to copy/convert both strings, and makes a preliminary length check unnecessary.

Code: Select all

cat /tmp/equal.bas
function equal_lcase(byref s1 as const string, byref s2 as const string) as boolean
  if len(s1) <> len(s2) then return false
  for i as integer = 0 to len(s1)-1
    if s1[i] = s2[i] then
      '' same ASCII value
    elseif ((s1[i] xor s2[i]) and not 32) = 0 then
      if cunsg((s1[i] and not 32) - asc("A")) < 26 then
        '' same letter, different case
      else
        '' not a letter
        return false
      end if
    else
      '' not equal
      return false
    end if
  next i
  return true
end function

assert(equal_lcase("", ""))
assert(equal_lcase("ABCdef", "abcDEF"))
assert(equal_lcase("Hello", "hello"))

assert(not equal_lcase("123", "1234"))
assert(not equal_lcase("A", "B"))
assert(not equal_lcase("A", "b"))

for i as integer = 0 to 255
  assert(equal_lcase(chr(i), chr(i)))

  select case (i and not 32)
  case asc("A") to asc("Z")
    assert(equal_lcase(chr(i), chr(i xor 32)))
  case else
    assert(not equal_lcase(chr(i), chr(i xor 32)))
  end select
next i
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Post by MrSwiss »

counting_pine wrote:I wonder if skipping the length check will save enough time
The skipping of the len-check is exclusively for: "known to be equal len" strings (as from a formated *.csv or similar).
NOT for the typical string array containing strings of arbitrary length.

The lenght check is "optimisation" because it skips a lot of those string conversions/comparisons ...
Probably in the region of reducing them (in a typicall string-array) to <= 10%, skipping >= 90%.
(Just to explain the logic I'm employing.)

You seem to go far deeper into details here. I'll look into that later too.
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Post by counting_pine »

Sorry, I phrased my suggestion badly, and made it sound like I meant the opposite of what I intended.
I agree, the length check is a great time saver. I also suggest that its cost is negligible enough that it’s not worth duplicating most of the code just for the purpose of skipping it.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Post by MrSwiss »

@counting_pine, great idea/job ... thank you.

I've finally dug into your code, optimized it to the last CPU clock-tick and inserted it into MakeUnique():
  • offhand speed gain: roughly a factor 10 (not timed, could be more)
  • test-file size: about 400 KB (type: *.csv)
  • twice tested:
    • - first: splitted into lines (equal string length, about 4'300 lines)
      - second: splitted into words (different string length, much shorter, roughly 48'000 words)
The code now:

Code: Select all

' was: Function equal_lcase() -- original code by counting_pine
' renamed and recoded (for speed) -- MrSwiss
Function Equal_XCase( _                 ' case insensitive string comp.
    ByRef s1    As Const String, _
    ByRef s2    As Const String  _
    ) As Boolean
    If Len(s1) <> Len(s2) Then Return FALSE

    For i As UInteger = 0 to Len(s1) - 1
        If s1[i] = s2[i] Then
            ' same ASCII value
        ElseIf ((s1[i] Xor s2[i]) And -33) = 0 Then
            If CUnsg((s1[i] And -33) - 65) < 26 Then
                ' same letter, different case
            Else
                ' not a letter
                Return FALSE
            End If
        Else
            ' not equal
            Return FALSE
        End If
    Next
    ' all checks passed = are equal (except: case may differ)
    Return TRUE
End Function

' string treatment differs from numeric data-types ...
Private Sub MakeUnique OverLoad( _      ' case insensitive check
          a()   As String _
    )
    Var i = LBound(a), j = 0, _
        k = UBound(a)

    While i < k
        j = i + 1
        While j <= k
            If Equal_XCase(a(i), a(j)) Then ' string equality check _
                a(j) = a(k) : k -= 1    ' case insensitive comparison
            Else
                j += 1
            End If
        Wend
        
        i += 1
    Wend
    Redim Preserve a(LBound(a) To k)    ' remove 'empties'
End Sub
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Post by counting_pine »

Another potential speedup, although slightly unintuitive, would be to use Swap instead of assignments to reorder the array.
Swapping string descriptors leaves the string data unchanged, and saves having to create or destroy new strings.

In conjunction with the string comparison function, that should mean that no string allocations at all are required inside the routine.
MrSwiss
Posts: 3910
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: remove duplicates intries in an array

Post by MrSwiss »

I've tested your suggestion using: Swap ...
This time I've implemented timing, expecting a smallish difference only. No optimisations used.

Result: the difference (if any) varies more in-between runs, than in-between implementations ...
Leading to the conclusion: Swap doesn't influence the result in a measurable way.
(this may in a different context be very much different)

This proves clearly, that proper testing always beats (possibly erroneous) assumptions.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: remove duplicates intries in an array

Post by fxm »

It might be necessary to test this with many array elements containing long strings to see the difference.
counting_pine
Site Admin
Posts: 6323
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: remove duplicates intries in an array

Post by counting_pine »

I guess if there are O(n^2) comparisons and O(n) assignments, then the difference is likely to be negligible in many cases, except where most strings are long (comparable in size to the array length), and “obviously” different (i.e. different lengths or prefixes).
Post Reply