Making compact number representation

Tourist Trap · Post by **Tourist Trap** » Jul 25, 2015 17:11

Hi,

I'm in need of storing a very long list of numbers themselves quite long, which would lead to a very big file (it's the numbers related to randomize seed first random value affair - if you were curious). I want those numbers compacted so, but I need also this list to keep readable as it. No way to just use file compression and I'm counting then on a change of the number representation.

* For instance 2^22 = 4194304 ---> readability perfect, and 3 digits saved.

But in an other hand : 128 = 2^1+2^2+2^3....+2^6. This is rather long and we are loosing(!) However, we may be able to choose a mixed approach where we use base-n representation if we get an advantage, and stick at the original else. So to be sure to save space or nothing.

Second point we can handle, the number notation. Everyone knows about hexadecimal for instance. They can help save space also since *15_dec = F_hex*, which means that we need only one digit to store two over a given range of numbers (0-15 here). So to take advantage of it I propose using a wider notation base, like base 62 that I use in the code that I'm currently writting for this affair - see below. Why base 62? just as a showcase because it is possible to change a decimal (from 0 to 61), for : a digit, a lowcase character or an uppercase character... (After all considere also that decimal is a compaction of binary)

* For instance 77 777 decimal gives in base62 notation KET. Great space saved again.

Combining the 2 advantages I'm expecting some gain... But I would like to know here if someone has heard of a better idea for such a task?
Else, what would be the best compact choice of base in general, either for representation as power or for the notation issue ?... I'll try to formulate it better.

There is a lot of choice of base for writting N =sum( a*base^b )
There is a lot of choice of base for number notation , octal , decimal, hexadecimal, base62
==> is there any best choice over all the possibilies?
==> is there any best choice over performance issue (to get the representation computed fast)

- Thanks a lot.

Here joined a first attempt to play with representations. Some clean-up required but should work as expected for demo purpose.

Code: Select all

'																			.---.
'																			| 0 |
'																			.---.
Const As ULongInt U_MAXULONGINT = 2^64
Const As ULongInt U_MAXLONGINT = 2^63
Const As ULongInt _MAXULONGINT = U_MAXULONGINT - 1
Const As LongInt _MAXLONGINT = U_MAXLONGINT-1
'																			.---.
'																			| 1 |
'																			.---.
Type BASEnCOMPACT
	Declare Constructor(ByVal BASE_ As UByte = 2, NUMBER_ As ULongInt = 0)
	
	Declare Property BSE() As UByte                     ''*
	Declare Property BSE(As UByte)                      ''*
	Declare Property NUMBER() As ULongInt
	Declare Property NUMBER(As ULongInt)
	Declare Property GETTEXT() As String
	Declare Property REPRESENTATION() As String
	Declare Property GETFACTOR(ByVal Index As UByte) As UByte
	
	Declare Function GreaterExp() As LongInt
	Declare Function BaseExp(ByRef Exp_ As LongInt, _ 
									 ByVal Num_ As ULongInt=1) As ULongInt
	Declare Function BaseRep(Rep_() As UByte, _ 
									 ByVal Num_ As ULongInt=1) As ULongInt
	Private :
	 As UByte _BASE = 2
	 As ULongInt _NUMBER = 0
 	 As UByte _REPRESENTATION(63)
	 As LongInt _EXPONENT = 0
	 As ULongInt _ADDVALUE = 0
	 As String _TEXT = "2E0+A" 
	 As LongInt Ptr _EXPONENTLIST
	 As ULongInt Ptr _FACTORLIST
End Type 'BASEnCOMPACT
Constructor BASEnCOMPACT(ByVal BASE_ As UByte = 2, ByVal NUMBER_ As ULongInt = 0)
	 This._BASE = BASE_
	 This._NUMBER = NUMBER_
	For _index As UByte = 0 To 63
		 This._REPRESENTATION(_index) = 0
	Next _index
	Select Case NUMBER_
		Case 0
			 This._EXPONENT = -1
			 This._ADDVALUE = 0
			 This._TEXT = "0"
		Case Else
			 This.BaseExp(This._EXPONENT, NUMBER_)
			 This._ADDVALUE = NUMBER_ - BASE_^This._EXPONENT
			 This._TEXT = This.GETTEXT
	End Select 'NUMBER_	 
End Constructor 'BASEnCOMPACT(UBYTE[=2],ULONGINT[=0])
Property BASEnCOMPACT.BSE() As UByte
	Return This._BASE
End Property 'Get UBYTE:=BASEnCOMPACT.BSE()
Property BASEnCOMPACT.BSE(ByVal setBASE As UByte)
	 This._BASE = setBASE
	 This.NUMBER = This._NUMBER
End Property 'Set BASEnCOMPACT.BSE()
Property BASEnCOMPACT.NUMBER() As ULongInt
	Return This._NUMBER
End Property 'Get ULONGINT:=BASEnCOMPACT.NUMBER()
Property BASEnCOMPACT.NUMBER(ByVal setNUMBER As ULongInt)
	 This._NUMBER = setNUMBER
	 ''Fill fields accordingly	 
	Select Case setNUMBER
		Case 0
			 This._EXPONENT = -1
			 This._ADDVALUE = 0
			 This._TEXT = "0"
		Case Else
			 This.BaseExp(This._EXPONENT, setNUMBER)
			 This._ADDVALUE = setNUMBER - This._BASE^This._EXPONENT
			 This._TEXT = This.GETTEXT
	End Select 'setNUMBER
End Property 'Set ULONGINT:=BASEnCOMPACT.NUMBER()
Property BASEnCOMPACT.GETTEXT() As String
	Dim As String S  
	 S = Str(This._BASE)
	 S &= "^"& Str(This._EXPONENT)
	 S &= IIf(This._ADDVALUE=0, "", "+"& This._ADDVALUE)
	Return S
End Property 'Get STRING:=BASEnCOMPACT.GETTEXT()
Property BASEnCOMPACT.REPRESENTATION() As String
	For _index As UByte = 0 To 63
		 This._REPRESENTATION(_index) = 0
	Next _index
	 This.BaseRep(This._REPRESENTATION(), This._NUMBER)
	Dim As String R = "b"& This._BASE
	 R &= "<+"
	For _index As UByte = 0 To 63
		If This._REPRESENTATION(_index)<>0 Then
			 R &= This._REPRESENTATION(_index) & "*" & "^" & _index & "+"  
		End If
	Next _index
	 R &= ">"
	Return R
End Property 'Get BASEnCOMPACT.REPRESENTATION()
Property BASEnCOMPACT.GETFACTOR(ByVal Index As Ubyte) As UByte
	Return This._REPRESENTATION(Index)
End Property '
Function BASEnCOMPACT.GreaterExp() As LongInt
	Dim As UByte index = 63
	 This.BaseRep(This._REPRESENTATION(), This._NUMBER)
	
	While Not (index<=0 OrElse This._REPRESENTATION(index))
		 index -= 1
	Wend 'index>0

	Return index
End Function 'LONGINT:=BASEnCOMPACT.GreaterExp()
Function BASEnCOMPACT.BaseExp(ByRef Exp_ As LongInt, _ 
										ByVal Num_ As ULongInt=1) As ULongInt
	Exp_ = 0
	While Num_>=This._BASE
			 Exp_ += 1
			 Num_ \= This._BASE
	Wend 'Num_>=This._BASE

	Return Num_ - This._BASE^Exp_
End Function 'ULONGINT:=BASEnCOMPACT.BaseExp(REF_UINTEGER,ULONGINT[=1])
Function BASEnCOMPACT.BaseRep(Rep_() As UByte, _ 
										ByVal Num_ As ULongInt=1) As ULongInt
	Dim As UByte _Exp = 0
	Dim As ULongInt _Num = Num_
	Dim As ULongInt _Rem = Num_
	
	For _index As UByte = 0 To 63
		 Rep_(_index) = 0
	Next _index
	
	While _Rem >= This._BASE
		 _Rem = _Num 
		 _Exp = 0
		While _Num >= This._BASE
			 _Exp += 1
			 _Num = _Num \ This._BASE
		Wend		
		If _Exp = 0 Then 
			 Rep_(_Exp) += _Rem 
		Else
			 Rep_(_Exp) += 1
		 	 _Num = _Rem - This._BASE^_Exp
		EndIf
	Wend

	Return 0
End Function 'ULONGINT:=BASEnCOMPACT.BaseRep(REF_UBYTE(),ULONGINT[=1])
'																			.---.
'																			| 2 |
'																			.---.
Function Base62ToNotation(ByVal Cpn As BASEnCOMPACT) As String
	Var initialBASE = Cpn.BSE
	If initialBASE<>62 Then 
		Cpn.BSE=62
	EndIf
	Dim As String f62 = ""
	For _index As Byte = Cpn.GreaterExp() To 0 Step -1
		Select Case As Const Cpn.GETFACTOR(_index)
			Case 0 To 9
				 f62 &= Chr(48+Cpn.GETFACTOR(_index))
			Case 10 To 23
				 f62 &= Chr(65+Cpn.GETFACTOR(_index)-10)
			Case 24
				 f62 &= Chr(64)
			Case 25 To 35
				 f62 &= Chr(65+Cpn.GETFACTOR(_index)-10)
			Case 36 To 61
				 f62 &= Chr(97+Cpn.GETFACTOR(_index)-36)
		End Select 'As Const Cpn.GETFACTOR(_index)		
	Next '_index
	Cpn.BSE = InitialBASE

	Return f62
End Function 'STRING:=Base62ToNotation(STRING)
'																			.---.
'																			| 3 |
'																			.---.
Function SeedMapping(byval Seed as ULongInt) As ULongInt
    Randomize Seed
    Return Rnd()*1e+17
End Function 'ULONGINT:=SeedMapping(ULONGINT)
'																			.---.
'																			| 4 |
'																			.---.
'------------------------------------------------------------
'                          DEMO
'------------------------------------------------------------
Dim As String NUM
Dim As BASEnCOMPACT Cpn = BASEnCOMPACT(2, 77777)

Do
	Cls
	Locate 1,1
	? "DECIMAL NUMBER       " ; Cpn.NUMBER;
	? "               REPRESENTATION BASE  " ; Cpn.BSE	
	? "--------------"; 
	? "                           -------------------"
	?
	? "REPRESENTATION 1     " ; Cpn.GETTEXT
	? "----------------"
	? 
	? "REPRESENTATION 2"
	? "----------------"
	? Cpn.REPRESENTATION
	?
	? "REPRESENTATION 3 (BASE 62)"
	? "--------------------------"
	? Base62ToNotation(Cpn)

	Locate 24, 10 : Input "New decimal number OR [Q] to quit :"; NUM
	Cpn.NUMBER = CUlngInt(NUM)
Loop Until LCase (NUM) = "q" 
'------------------------------------------------------------

MichaelW · Post by **MichaelW** » Jul 26, 2015 4:54

The CRT _itoa,_i64toa,...functions are reliable and reasonably speed optimized, and can encode values in base 36.

Code: Select all

#include "crt.bi"
dim as zstring * 100 buffer
_itoa(1234567890,@buffer,36)
print buffer
sleep

Code: Select all

kf12oi

caseih · Post by **caseih** » Jul 26, 2015 5:12

If you're going to store numbers in alternate bases using ascii letters as well as numbers as digits, then the resulting text isn't really human readable anyway. So why not just store them in binary? That's the most compact form you can get.

jevans4949 · Post by **jevans4949** » Jul 26, 2015 7:40

You may be interested in the variable-length integer format used by MIDI files. This is a big-endian format, with 7 bits of data and the senior bit set on if there are more bytes to follow. The following decodes one to an integer.

Code: Select all

'***********************************************************************
' Extract midi variable-length integer, advancing source pointer
'***********************************************************************
function varint (byref xptr as ubyte ptr) as integer
    dim xval as integer = 0
    dim xsw as integer = 128
    while xsw = 128 
        xval = (xval shl 7) or ((*xptr) and 127)
        xsw = ((*xptr) and 128)
        xptr+=1
    wend
    function = xval
end function

Disaadvantages: you waste one bit per byte, and to access the 99th number you would have to scan the preceding 98. And it's not actually human-readable, although you could use a hex viewer, or write your own file viewer.

Tourist Trap · Post by **Tourist Trap** » Jul 26, 2015 13:53

MichaelW wrote:The CRT _itoa,_i64toa,...functions are reliable and reasonably speed optimized, and can encode values in base 36.

Nice thanks. I'll use this.
It can encode pretty well up to further than 36. I dont really understand what characters are used when ascii are exhausted, but this: _itoa(1234567890,@buffer,1234567890) return 10.

caseih wrote:If you're going to store numbers in alternate bases using ascii letters as well as numbers as digits, then the resulting text isn't really human readable anyway. So why not just store them in binary? That's the most compact form you can get.

Using alphanumerical allows a relative readability in a list, since it keeps track of patterns. For instance if the numbers are a series incremented by +1, you'll read that very easiliy. Also for periodicity. And for numbers up to 16 is like hexadecimal so it should be manageable. I've been using yesterday a representation based on the 222 visible ascii characters so not far to the 255 which is the binary if I'm not wrong? But there is not a big digit reduction from 222 to 255, or even from 62 to 255.

So for the moment I tend to think that notation62 is pretty optimal. I wonder also what would be to use unicode but I'm far to be at ease with this.

jevans4949 wrote:You may be interested in the variable-length integer format used by MIDI files.
(...)
And it's not actually human-readable, although you could use a hex viewer, or write your own file viewer.

I will try to dig out around this. It may help for a first step in file storage, before a further compression. But for the display, I'm projecting to put the list on a web page so some readability would be mandatory, at least displayability.

Thanks all.

srvaldez · Post by **srvaldez** » Jul 26, 2015 15:10

MichaelW wrote:The CRT _itoa,_i64toa,...functions are reliable and reasonably speed optimized, and can encode values in base 36.

there doesn't seem to be a reverse-function available, that is, there's no atoi function with arbitrary radix option.

RockTheSchock · Post by **RockTheSchock** » Jul 26, 2015 19:08

I think the most compact format is if you store the numbers binary without any type of compression. You have randomly generated relativ evenly distributed numbers, so any lossless compression isn't very effective. Well if you bloat them by choosing human readable format to then compact them again it wont be as compact as pure binary format, unless there is very big bug in the number generator.

What do you want to accomplish in the end? A web interface to filter / sort / show numbers in statistical analysation? There is no point in storing so many numbers in a file in a human readable format. To get a meaning out of it there must be some sort of program. For the program it doesnt matter which format the numbers are stored in.

Tourist Trap · Post by **Tourist Trap** » Jul 26, 2015 21:43

srvaldez wrote:There doesn't seem to be a reverse-function available, that is, there's no atoi function with arbitrary radix option.

Surprising. However if the way it is done is clear, like say number n ---->mapped to---> chr(n) for instance, it wouldn't be a problem to write the inverse function.

RockTheSchock wrote: There is no point in storing so many numbers in a file in a human readable format.

By readable I mean number written differently, yet not too tricky. Change of base, exponential notation, baseN notation, this all matches. What wont be at all interesting would be an overall compression scheme. Of course with the correct interface that would unzip on the fly what has to be displayed to the screen, that would be great but would require a deep understanding of zipped content, which, as you can imagine I haven't at all.

RockTheSchock wrote:You have randomly generated relativ evenly distributed numbers, so any lossless compression isn't very effective.

Here is one row sample of the data : U=9223372036853992959 ; R1= 0.6552179658319801. Worth noticing that _itoa36 change those for : U = 1z0n9xb ; R1 = epse61 (for R1*1e+18).

U stands over ULongInt so when you think that 2^60 for instance is one of them, it seems not silly to also trade the decimal form for the exponential one when it beats _itoa.

For R1, the datas are random but only by pack of 650 if I've understood the Mersenne algorithm. Moreover I'm projecting to multilply by 1e+18 those double comprised between 0 and 1. That's most often very long numbers where some compact representation can help.

You have so the keys to understand what I'm doing. I have a readability limit, but if pushed too far, this won't be so much readable of course. And for the rest, if I can divide by 2 the overall amount of char to store I would find this not so bad. Now i'll have to make some tests to say more.

RockTheSchock wrote:What do you want to accomplish in the end?

Just be able to get the whole table of inverses of U --->R1.

RockTheSchock · Post by **RockTheSchock** » Jul 27, 2015 6:42

Tourist Trap wrote:You have so the keys to understand what I'm doing.
RockTheSchock wrote:What do you want to accomplish in the end?
Just be able to get the whole table of inverses of U --->R1.

Thats a very specific problem you want to solve. But it doesnt help me to understand the big picture. You have a file with millions or billions of numbers. What shall I do with such a file. Should i open it in a textviewer and look at the numbers? Maybe I search for some numbers...... After some minutes my text programm eventually finds an entry or ... just crashes.

Tourist Trap wrote:What wont be at all interesting would be an overall compression scheme. Of course with the correct interface that would unzip on the fly what has to be displayed to the screen, that would be great but would require a deep understanding of zipped content, which, as you can imagine I haven't at all.

Like you would write an on-the-fly unzip routine, you could just convert from binary to human readable format for display.

If you want to have a very compact solution you just safe the seed and the index to the number in the sequence. Generate the number on the fly. By using Randomize with algorithm 3 ( Mersenne Twister), which should be stable, you can generate the exact same sequence from stored seed without any problems.

dodicat · Post by **dodicat** » Jul 27, 2015 13:19

Here is a thing I did years ago, on the fly compression/expansion of whole numbers.
In this case ulongintegers.

Code: Select all

 

function Intrange(L as ulongint,U as ulongint) as ulongint
    return (Rnd*((U+1)-(L))+(L))
end function



Dim Shared As String setkey(0 To 99)

Function SAR(s0 As String,s1 As String,s2 As String) As String'sar ~~ search and replace
    Dim s As String=s0
   var position=Instr(s,s1)
    While position>0
        s=Mid(s,1,position-1) & s2 & Mid(s,position+Len(s1))
        position=Instr(position+Len(s2),s,s1)
    Wend
    SAR=s
End Function

Function compress(Byref num1 As String,key As Integer=27) As String
    key=key+129
    If key<0 Then key=0
    If key>156 Then key=156
    Dim As String num2=num1
    For x As Integer=0 To 99:setkey(x)=Chr(x+key):Next
        Dim As Integer s1=99
        Do 
            num2=SAR(num2,Str(s1),setkey(s1))
            s1=s1-1
        Loop Until s1=-1
        Return num2
    End Function
    
    Function expand(Byref num1 As String,key As Integer=27) As String 
        key=key+129
        If key<0 Then key=0
        If key>156 Then key=156
        For x As Integer=0 To 99:setkey(x)=Chr(x+key):Next
            Dim As String num2=num1
            Dim As Integer s1=99
            Do 
                num2=SAR(num2,setkey(s1),Str(s1))
                s1=s1-1
            Loop Until s1=-1
            Return num2
        End Function
        
  '===========================================================      
        dim as string C,E
        dim as ulongint U
        do
       U= intrange(1,18446744073709551615)'get a random ulongint
       print "Ulongint";tab(30);u
       C= compress(str(u))
       print "Compressed";tab(30); C
       E=expand(C)
       print "Expanded";tab(30); E
       print
       print
       sleep
   loop until inkey=chr(27)

Tourist Trap · Post by **Tourist Trap** » Jul 27, 2015 15:35

dodicat wrote:Here is a thing I did years ago, on the fly compression/expansion of whole numbers.
In this case ulongintegers.

Thanks dodicat.

RockTheSchock wrote:Thats a very specific problem you want to solve. But it doesnt help me to understand the big picture. You have a file with millions or billions of numbers. What shall I do with such a file. Should i open it in a textviewer and look at the numbers?

Sorry, it was some obvious the way i would use the list that I've omitted details. So this is the big picture :

step1 - program maps a ulongint U to a result R
step2 - program change number representation in a predictible and reversable way
step3 - program makes a file on disk, not for the whole list but say it writes 1e+4 lines each time
step4 - i add this to a web page, clean up my disk and reiterate step3

Once done with this, the way to search a number is not difficult. I take a number to search, say S, and apply to it the representation change (predictible as said before). It will then search the compacted item in the list. Then I get its brother item which is still compacted and I reverse the representation (it's reversable). That's all.

I've not really made this strategy for the whole list as one whole set, it will probably have to get split. But I would like the set to be as long as possible.

What doesn't matter is not to get the whole list in one row because to generate it with my old computer is so long that I think I have the time to deal with the performance issue later...

The code I've written first was this one :

Code: Select all

Dim As String fileFullName = "D:/Temp/mersenne.txt"
Dim As Integer f = FreeFile

Const  As ULongInt MaxPivot = 9223372036854775295
Dim As Double R1, R2
Dim As ULongInt startblocknum, endblocknum, u, N = 1
ReDim As ULongInt Pivot(1 To N)
ReDim As Double Value(1 To N)
Dim As Double startTime, endTime
    Randomize MaxPivot+1
    R1 = Rnd()

      endblocknum = 1
      startblocknum = endblocknum-1

Do
        startTime = Timer
    
      Open fileFullName For Output As f
      Print #f, "startTime="; startTime
      Print #f, "endblocknum="; endblocknum
    
    For u = MaxPivot-startblocknum*1e+6 To MaxPivot-endblocknum*1e+6 Step -1
         Randomize u
         R2 = Rnd()
        If Not(R1=R2) Then
                 N += 1
                 ReDim Preserve Pivot(1 To N)
                  ReDim Preserve Value(1 To N)
                 Pivot(N) = u
                 Value(N) = R1
                 Print #f, "N="; N; " U=";U
                 Print #f, "R1="; R1
                 '? "N "; N, " U ";U
                 '? "  ", "    R1 "; R1
        EndIf
         R1 = R2
    Next u
    
endTime = Timer - startTime
Print #f, "endTime="; endTime
Close #f

? "N "; N, " U ";U
? "  ", "    R1 "; R1

? "Press a key to launch next row n°"; endblocknum
      Sleep
      endblocknum += 1
      startblocknum = endblocknum-1
Loop Until InKey = Chr(27) Or u=0

? "END"

Sleep : End 0

dodicat · Post by **dodicat** » Jul 27, 2015 16:04

srvaldez wrote:
MichaelW wrote:The CRT _itoa,_i64toa,...functions are reliable and reasonably speed optimized, and can encode values in base 36.
there doesn't seem to be a reverse-function available, that is, there's no atoi function with arbitrary radix option.

Hi srvaldez
I found this reverse function.
Question:
Why are these C functions not mentioned in the CHM help file?
Base 32 works, not 64.
So the numbers are restricted to long .

Code: Select all

#include "crt.bi"
dim as zstring * 100 buffer

function Intrange(L as long,U as long) as long
    return (Rnd*((U+1)-(L))+(L))
end function


dim as zstring ptr dummy
dim as long number
do
    number=IntRange(1,2147483647)
    print "Number ";number
    _itoa(number,@buffer,32)
    print "Base 32 "; buffer
dim as long ans=strtol(buffer,@dummy,32)
print  "Number ";ans
print
print
sleep
loop until inkey=chr(27)
    
sleep

srvaldez · Post by **srvaldez** » Jul 27, 2015 16:27

Thank you dodicat :)

dodicat · Post by **dodicat** » Jul 27, 2015 19:30

You don't need the dummy zstring ptr, just use 0
...
dim as long ans=strtol(buffer,0,32)
...

36 is the highest base here.
Of course my msvcrt.dll is probably outdated (Win XP), and I am still on 32 bit OS.

Making compact number representation

Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation

Re: Making compact number representation