DEV Community

Cover image for Implementing a simple hand coded scanner
IWasHiding
IWasHiding

Posted on • Updated on

Implementing a simple hand coded scanner

I want to start this post by quoting what Imam Al-Shafi’i الامام الشافعي said about knowledge and learning

العـلــم صيــد والكتابة قيــده ،،، قيّد صيودك بالحبال الواثقة
فمن الحماقة أن تصيد غزالة ،،، وتتركها بين الخلائق طالقه
Enter fullscreen mode Exit fullscreen mode

If we want to explain the meaning of this verse in English, and you should know, dear reader, that what I'm going to write is not a translation of the two verses, but rather a simplified explanation in English.

So Imam Al-Shafi’i here is trying to explain to us that learning and knowledge is like a hunting a gazelle, If we catch a gazelle, we must tie it up and not let it run as it pleases, otherwise this will not be considered hunting.

Science, knowledge, or learning is like a deer, and to tie it up, we must write it down, and that's why I'm writing this post, because i learned something new which is hand coded scanner i also wrote this scanner by writing tests first, so I'm also practicing Test Driven Development, because I'm planning to follow this principle in my next projects.

Although what I will do is just small potion of code, it is like a just enough code to understand some concepts about hand coded scanners, you can complete what i did or give some feedbacks too.

Our objective here is to write a hand coded scanner for the following DFA.

DFA for our hand coded scanner

This DFA is from Engineering a compiler by Keith Cooper Linda Torczon, on page 98.

So let's start with writing tests first, we will need to test the initialization part, check the code below please.

#########################################################################
# test a hand coded a scanner for the following RES "abc|bca|cab"       #
#########################################################################

import unittest
from hand_coded_scanner import Scanner

FILE_MOCK = "azeaze,azzqbz,nnnnn,abc,accc," \
            "abababa,bca,cab"

class HandCodedScannerTest(unittest.TestCase):
   def setUp(self):
      self.scanner = Scanner(2)

   def test_init_scanner(self):
      self.assertEqual(self.scanner.buffer_limit, 2)
      self.assertEqual(self.scanner.current_state, None)
      self.assertEqual(self.scanner.buffer, '')
      self.assertEqual(self.scanner.char, '')
      self.assertEqual(self.scanner.current_input, 0)
      self.assertEqual(self.scanner.current_buffer_input, 0)
      self.assertEqual(self.scanner.file_content, '')

Enter fullscreen mode Exit fullscreen mode

As you can see in the first line we are using the famous python testing library unittest, and then we imported the Scanner class from a file named hand_coded_scanner, both of them does not exists yet.

Now if we try to run python test_hand_coded_scanner.py we will simply get the following error ModuleNotFoundError: No module named 'hand_coded_scanner'.

So lets create our new file hand_coded_scanner.py, we also need to add the Scanner class.

# hand_coded_scanner.py
class Scanner:
  pass
Enter fullscreen mode Exit fullscreen mode

We can run our test again and we will get another error about this line of self.scanner = Scanner(2) because we are passing 2 as an argument to the Scanner class, but in hand_coded_scanner we are not telling our class that we are going to pass an argument to it, and to fix this we will add the some lines of code to the scanner class check the code below please.

Note that 2 is the buffer size limit for our scanner.

# hand_coded_scanner.py
class Scanner:
   def __init__(self, buffer_limit, dfa=None) -> None:
       pass
Enter fullscreen mode Exit fullscreen mode

By running our tests file another error will be displayed, AttributeError: 'Scanner' object has no attribute 'buffer_limit', so lets add the different attributes that we are testing in test_init_scanner, check the code below please

# hand_coded_scanner.py
class Scanner:
   def __init__(self, buffer_limit, dfa=None) -> None:
      self.buffer_limit = buffer_limit
      self.current_state = None
      self.buffer = ''
      self.char = ''
      self.current_input = 0
      self.current_buffer_input = 0
      self.file_content = ''
Enter fullscreen mode Exit fullscreen mode

In the next step we will write tests for a function to read content from a file and put the data into the array or buffer, we added the code below to test_hand_coded_scanner.py.

FILE_MOCK = "azeaze,azzqbz,nnnnn,abc,accc," \
            "abababa,bca,cab"
..
..
..
   def test_add_to_buffer(self):
      self.scanner.file_content = str(FILE_MOCK)
      self.scanner.add_to_buffer()
      self.assertEqual(self.scanner.buffer, 'az')
      self.scanner.add_to_buffer()
      self.assertEqual(self.scanner.buffer, 'ea')
      self.scanner.add_to_buffer()
      self.scanner.add_to_buffer()
      self.scanner.add_to_buffer()
      self.assertEqual(self.scanner.buffer, 'zz')
Enter fullscreen mode Exit fullscreen mode

FILE_MOCK mock here is a data similar of what we are going to read from a normal file later when we start using our scanner class and simulate the DFA.

If we try to run this test file now we will get AttributeError: 'Scanner' object has no attribute 'add_to_buffer' because our Scanner class does not have a function named add_to_buffer.

Let's do this by adding the logic that we are going to add to get the needed data from the file, we should not forget that we are limited by buffer_limit, so the length of the data that we can take from the file is limited by the number that we pass when we call our Scanner class, for example scanner = Scanner(39), in this case we can only take a limited amount of data, which is 39 in length.

After adding the function below to the Scanner class, we can execute the tests and everything will run smoothly.

def add_to_buffer(self) -> None:
      last_input = self.current_input
      self.current_input += self.buffer_limit
      self.buffer = self.file_content[last_input:self.current_input]
Enter fullscreen mode Exit fullscreen mode

Next, we can add tests for the function that increase our current character index, we will need to test if our function is really incrementing the current character index, and if we are getting the character we want.

self.scanner.buffer = 'ab'

char = self.scanner.next_char()
self.assertEqual(char, 'a')
self.assertEqual(self.scanner.current_buffer_input, 1)
Enter fullscreen mode Exit fullscreen mode

Then we will need to test this for a second time to make sure that everything work fine.

char = self.scanner.next_char()
self.assertEqual(char, 'b')
self.assertEqual(self.scanner.current_buffer_input, 0)

Enter fullscreen mode Exit fullscreen mode

As you can see the current character self.scanner.current_buffer_input is incrementing and we are getting 'b', this is exactly what we want.

Now, if we run the tests we will get an error, so let's go back and add, next_char to the Scanner class, check the code below please.

def next_char(self) -> str:
      char = self.buffer[self.current_buffer_input]
      self.current_buffer_input = (self.current_buffer_input + 1) % self.buffer_limit
      return char
Enter fullscreen mode Exit fullscreen mode

Our tests will pass successfully now.

Lets use this scanner class now to simulate our DFA, we will first need to create new class called Runner, in the class we will add an attribute named DFA, this is the object that represent our DFA in a form of dictionary, where the State is the keys, the next state is an a sub dict called 'next' and the character that make us go to this states is in a form of an array, we will also create a new scanner and 10 as a limited length for our buffer.

class Runner:
   def __init__(self, dfa=None) -> None:
      self.dfa = dfa if dfa else {
         'S0': {'next':
            {
               'S1':['a'],
               'S2':['b'],
               'S3':['c'],
            }
         },
         'S1': {'next':
            {
               'S4':['b'],
            }
         },
         'S2': {'next':
            {
               'S5':['c'],
            }
         },
         'S3': {'next':
            {
               'S6':['a'],
            }
         },
         'S4': {'next':
            {
               'S7':['c'],
            }
         },
         'S5': {'next':
            {
               'S8':['a'],
            }
         },
         'S6': {'next':
            {
               'S9':['b'],
            }
         },
         'S7': {'next':None},
         'S8': {'next':None},
         'S9': {'next':None},
      }

      self.scanner = Scanner(10)
Enter fullscreen mode Exit fullscreen mode

Now we will add a function to go from a state to another,i choose to name it next state.

 def next_state(self, char) -> None:
      current_state = self.scanner.current_state
      state = self.dfa.get(current_state, None)
      next_state = state.get('next', None) if state else None
      self.scanner.current_state = None
      if next_state:
         for key, value in next_state.items():
            if char in value:
               self.scanner.current_state = key
               break

Enter fullscreen mode Exit fullscreen mode

Now it is time to add our main function to simulate the DFA execute, you should know that i implemented the logic to read just a one time from the file.

def execute(self) -> None:
      self.scanner.file_content = self.open_file()
      self.scanner.add_to_buffer()
      self.scanner.current_state = 'S0'
      char = self.scanner.buffer[0]
      last_start_of_word = 0

      for index in range(1,10):
         char = self.scanner.next_char()


         if char == ',':
            word = self.scanner.buffer[last_start_of_word : self.scanner.current_buffer_input - 1]
            last_start_of_word = self.scanner.current_buffer_input
            self.scanner.current_state = 'S0'
            self.scanner.next_char()
            continue

         self.next_state(char)

         if self.scanner.current_state == None:
            word = self.scanner.buffer[last_start_of_word : self.scanner.current_buffer_input]
            print(word)
            break

Enter fullscreen mode Exit fullscreen mode

As you can see there is this condition if char == ',': which is added to test if there is ',' , this character mark the end of the current word and a start of another word, we should also ignore this character which is not a part of our DFA.

This condition if self.scanner.current_state == None: tell us that we are at the end of the DFA, and our word is accepted so we simply need to print it.

In the end i want to tell you that i only implemented some basic functionality of the hand coded scanner, for example we can only read from the file once, a scanner will need to read from a file so many times and fill the buffer, and because of this the RollBack function is needed.

Below you can see how our two files will look like in the end.

# main file
class Scanner:
   def __init__(self, buffer_limit, dfa=None) -> None:
      self.buffer_limit = buffer_limit
      self.current_state = None
      self.buffer = ''
      self.char = ''
      self.current_input = 0
      self.current_buffer_input = 0
      self.file_content = ''

   def add_to_buffer(self) -> None:
      last_input = self.current_input
      self.current_input += self.buffer_limit
      self.buffer = self.file_content[last_input:self.current_input]

   def next_char(self) -> str:
      char = self.buffer[self.current_buffer_input]
      self.current_buffer_input = (self.current_buffer_input + 1) % self.buffer_limit
      return char


class Runner:
   def __init__(self, dfa=None) -> None:
      self.dfa = dfa if dfa else {
         'S0': {'next':
            {
               'S1':['a'],
               'S2':['b'],
               'S3':['c'],
            }
         },
         'S1': {'next':
            {
               'S4':['b'],
            }
         },
         'S2': {'next':
            {
               'S5':['c'],
            }
         },
         'S3': {'next':
            {
               'S6':['a'],
            }
         },
         'S4': {'next':
            {
               'S7':['c'],
            }
         },
         'S5': {'next':
            {
               'S8':['a'],
            }
         },
         'S6': {'next':
            {
               'S9':['b'],
            }
         },
         'S7': {'next':None},
         'S8': {'next':None},
         'S9': {'next':None},
      }

      self.scanner = Scanner(10)

   def open_file(self) -> str:
      return open("content.txt", "r").read()

   def next_state(self, char) -> None:
      current_state = self.scanner.current_state
      state = self.dfa.get(current_state, None)
      next_state = state.get('next', None) if state else None
      self.scanner.current_state = None
      if next_state:
         for key, value in next_state.items():
            if char in value:
               self.scanner.current_state = key
               break

   def execute(self) -> None:
      self.scanner.file_content = self.open_file()
      self.scanner.add_to_buffer()
      self.scanner.current_state = 'S0'
      char = self.scanner.buffer[0]
      last_start_of_word = 0

      for index in range(1,10):
         char = self.scanner.next_char()


         if char == ',':
            word = self.scanner.buffer[last_start_of_word : self.scanner.current_buffer_input - 1]
            last_start_of_word = self.scanner.current_buffer_input
            self.scanner.current_state = 'S0'
            self.scanner.next_char()
            continue

         self.next_state(char)

         if self.scanner.current_state == None:
            word = self.scanner.buffer[last_start_of_word : self.scanner.current_buffer_input]
            break




if __name__ == '__main__':
   Runner().execute()

Enter fullscreen mode Exit fullscreen mode
#########################################################################
# test a hand coded a scanner for the following RES "abc|bca|cab"       #
#########################################################################

import unittest
from hand_coded_scanner import Scanner

FILE_MOCK = "azeaze,azzqbz,nnnnn,abc,accc," \
            "abababa,bca,cab"

class HandCodedScannerTest(unittest.TestCase):
   def setUp(self):
      self.scanner = Scanner(2)

   def test_init_scanner(self):
      self.assertEqual(self.scanner.buffer_limit, 2)
      self.assertEqual(self.scanner.current_state, None)
      self.assertEqual(self.scanner.buffer, '')
      self.assertEqual(self.scanner.char, '')
      self.assertEqual(self.scanner.current_input, 0)
      self.assertEqual(self.scanner.current_buffer_input, 0)
      self.assertEqual(self.scanner.file_content, '')

   def test_add_to_buffer(self):
      self.scanner.file_content = str(FILE_MOCK)
      self.scanner.add_to_bufferz()
      self.assertEqual(self.scanner.buffer, 'az')
      self.scanner.add_to_buffer()
      self.assertEqual(self.scanner.buffer, 'ea')
      self.scanner.add_to_buffer()
      self.scanner.add_to_buffer()
      self.scanner.add_to_buffer()
      self.assertEqual(self.scanner.buffer, 'zz')

   def test_next_char(self):
      self.scanner.buffer = 'ab'

      char = self.scanner.next_char()
      self.assertEqual(char, 'a')
      self.assertEqual(self.scanner.current_buffer_input, 1)


      char = self.scanner.next_char()
      self.assertEqual(char, 'b')
      self.assertEqual(self.scanner.current_buffer_input, 0)



if __name__ == '__main__':
   unittest.main()
Enter fullscreen mode Exit fullscreen mode

Discussion (0)