DEV Community

gyudoza
gyudoza

Posted on

Search safe number compression algorithm(feat. Python)

There was some happening.

A particular value interpreted in a certain data had to be stored in elasticsearch and taken through a search, but the particular value was a number with a length of more than 300 characters.
So, I tried to implement the search through fuzzy or more_like_this, but the search was not good because it was too long. I thought it would be much better to compress and store this value if I were to use it only for search purposes without rewriting it anyway.

I simply left 0 to 9 and approached it with ASCII as if compressing numbers from 10 using alphabets or other numbers. People who have used search engines such as Google may know that search engines have special processing characters (such as ", space, ', etc.), so i created search-safe compression algorithms without using that special chars.

import random


chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_'
decoder = {}
encoder = {}
number = 10
for char in chars:
    str_number = str(number)
    decoder[char] = str_number
    encoder[str_number] = char
    number += 1


def create_random_number_string(length: int):
    return ''.join(str(random.randrange(0, 10)) for _ in range(length))


def compress(number_string: str):
    for number in encoder:
        number_string = number_string.replace(number, encoder[number])
    return number_string


def extract(compressed: str):
    for char in decoder:
        compressed = compressed.replace(char, decoder[char])
    return compressed


for _ in range(5):
    random_length = random.randrange(150, 200)
    test_number_string = create_random_number_string(random_length)
    compressed = compress(test_number_string)
    extracted = extract(compressed)
    compressed_rate = ((len(test_number_string) - len(compressed)) / len(test_number_string))
    print(len(test_number_string), ':', test_number_string)
    print(len(compressed), ':', compressed)
    print(len(extracted), ':', extracted)
    print('%0.2f%s' % (compressed_rate * 100, '%'))
    print("=======================================")
    print()

Enter fullscreen mode Exit fullscreen mode

For 5 test cases

173 : 38217891529214746234295363014799197479060934296524329944079434354710188378778646041457672365438358060229029172126710046843892063571710775285494291604692579406512763592800742
118 : C2h89ft2e7Kn4t5Aue799j7L90Y9yt65o3t94E794yzLai8B87786K04eV67n6SCz80Ym90th2c67a0K84C9k6z7ha775s5N4tg0K9p79E65c76z9s007G
173 : 38217891529214746234295363014799197479060934296524329944079434354710188378778646041457672365438358060229029172126710046843892063571710775285494291604692579406512763592800742
31.79%
=======================================

175 : 7639716822342277816032528043792097211529552642763043314807290564221988329254460181232184798255115898904013545953569374090168012001578589355676046306782315697282643001276497663
119 : 76D7g8mym778g03ps04B9k972b5tTq4r6u4xe807t0U4mj883tpIYicwiL98p5bW9890Ed5J95z69BE90g80c00f78W9zU7YKu678nf697sq4u0c76N7663
175 : 7639716822342277816032528043792097211529552642763043314807290564221988329254460181232184798255115898904013545953569374090168012001578589355676046306782315697282643001276497663
32.00%
=======================================

169 : 7493139163694318425430833685052034163810084088304567519918460284909110852590873676525955716306253186813604736222677625163963843917395318561832048231972429826794656963474
110 : 7N3d9gA9Hi4p4u8x68O5kygCa08E88uJ675j9iK0sN091a85p9087A765p9T7gu6p3i68dYLAmq776pgD6C4DhDRiUi3kMnj7ot8q79KU96y74
169 : 7493139163694318425430833685052034163810084088304567519918460284909110852590873676525955716306253186813604736222677625163963843917395318561832048231972429826794656963474
34.91%
=======================================

164 : 12026339925028632674921994888837232115038090276115065754862563107795134822572088600680031246095798421381627915938027997533115145310302719886551641616281934706968564
111 : c0qx99p0s63q7N2j9M888Bn2bOC090r6bO6V5M6p63a7795dMmVk88Y068003cK09V98Gd8gr9f9C0r9975xb5eRaurj886Tg4ggsjy706968U4
164 : 12026339925028632674921994888837232115038090276115065754862563107795134822572088600680031246095798421381627915938027997533115145310302719886551641616281934706968564
32.32%
=======================================

176 : 73946801840514332235134053456152477400743521396323428813867397457955677468949406618792480382695298743841357684557186163646921605264305170043119348188501288987043495676184196146
121 : 7DK80iE5exmzdE5yUfo77E074z2d963n4s8d867D7J79T677K89NE66i79o80Cq95t874C4dV68JVi6gAK92g05q4u5h00Hb9y8i8Oc8898704y9U76i4j6e6
176 : 73946801840514332235134053456152477400743521396323428813867397457955677468949406618792480382695298743841357684557186163646921605264305170043119348188501288987043495676184196146
31.25%
=======================================
Enter fullscreen mode Exit fullscreen mode

You can check the compression rate of slightly more than 30% on average.

Running example

It has an average compression rate of 32.35 percent in 10,000 test cases, so it seems to be somewhat useful. When used on a particular platform, it is easy to increase the compression rate by adding the search_safe character on that platform to chars.

Of course, let's not make the mistake of adding multi-byte characters like Korean.

Discussion (0)