How to extract sentence from spaCy

#programming #python #computerscience #datascience

In the part of NLP, I have some sections to complexity extract words from dependency inform to (nsubj, VERB, dobj etc.) from simple sentences and complex sentences.

For Example

Simple Sentence
- Subject <- VERB -> dobj

A complex sentence has many layers of dependency
# has many subject
- (nsubj -> nsubj) <- VERB -> dobj
# has many verb
- (nsubj -> nsubj) <- VERB -> dobj, VERB -> VERB
# has many object
- (nsubj ->nsubj) <- VERB -> dobj, VERB -> VERB -> dobj

I have to introduce a method for extracting the sentence with python from spaCy, Analyzing grammar structure by creating step rules for extracting words.

Step 1: Import spaCy

Step 2: Create class Phrases() for extracting sentence. You can followup on the code below.

import spacy

class Phrases():
   def __init__(self, sentence):
       self.nlp = spacy.load('en_core_web_sm')
       self.sentence = str(sentence)
       self.doc = self.nlp(self.sentence)
       self.sequence = 0
       self.svos = []

Step 3: Create a method for merging phrases of a sentence. You can followup on the code below.

def merge_phrases(self):
    with self.doc.retokenize() as retokenizer:
        for np in list(self.doc.noun_chunks):
                attrs = {
                    "tag": np.root.tag_,
                    "lemma": np.root.lemma_,
                    "ent_type": np.root.ent_type_,
                }
                retokenizer.merge(np, attrs=attrs)
    return self.doc

We create a new tokenizer object called retokenizer.
We loop through all the noun chunks in the document.
We merge the noun chunks into a single token.
We return the document. The doc object is being returned.

Step 4: Create a method for merging the punctuation of sentences. You can followup on the code below.

def merge_punct(self):
        spans = []
        for word in self.doc[:-1]:
            if word.is_punct or not word.nbor(1).is_punct:
                continue
            start = word.i
            end = word.i + 1
            while end < len(self.doc) and self.doc[end].is_punct:
                end += 1
            span = self.doc[start:end]
            spans.append((span, word.tag_, word.lemma_, word.ent_type_))
        with self.doc.retokenize() as retokenizer:
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
        return self.doc

If the word is a punctuation mark, or if the next word is not a punctuation mark, then skip it.
Otherwise, start with the current word, and keep adding words to the span until you reach the end of the document, or until you reach a word that is not a punctuation mark.
Then, merge the span into a single token. The doc is being returned.

Step 5: Create a method for extracting the grammar of sentences. In the first step, I will get the text on the sentence parameter, Initialize NLP, Analyze the sentence and keep it in the doc object. You can followup on the code below.

def get_svo(self, sentence):
   doc = self.nlp(sentence)
   doc = self.merge_phrases()
   doc = self.merge_phrases()

From this method, I have to extend checking passive and active sentences with doc object, find all the main verbs and child of a verb, find the subject with the main verb, find the subject with the conjunction of the subject, find the object with the main verb, find prepositional modifier with the main verb, extract S, V, O and prepare SVO to finalize the list.

In the first part of checking passive and active sentences, I have to check passive or active sentences by getting the token from doc.

Simple Sentence
#Subject <- VERB -> dobj

In the sample sentence, We found the "auxpass" in the dependency of the sentence.
#Subject <- auxpass <- VERB -> dobj

def is_passive(self, tokens):
   for tok in tokens:
      if tok.dep_ == "auxpass":
        return True
   return False

On the finding of verbs method, I have got all the main verbs. You can followup the code below.

def _is_verb(self, token):
   return token.dep_ in ["ROOT", "xcomp", "appos", "advcl", "ccomp", "conj"] and token.tag_ in ["VB", "VBZ", "VBD", "VBN", "VBG", "VBP"]

def find_verbs(self, tokens):
   verbs = [tok for tok in tokens if self._is_verb(tok)]
   return verbs

On the finding of the subjects, I have got all the subjects from verbs. You can follow up on the code below.

def get_all_subs(self, v):
   #get all subjects
   subs = [tok for tok in v.lefts if tok.dep_ in ["ROOT", "nsubj", "nsubjpass"] and tok.tag_ in ["NN" , "NNS", "NNP"]]
   if len(subs) == 0:
     #get all subjects from the left of verb ("nsubj" <= "preconj" <= VERB)
     subs = [tok for tok in v.lefts if tok.dep_ in ["preconj"]]
     for sub in subs:
        rights = list(sub.rights)
        right_dependency = [tok.lower_ for tok in rights]
        if len(right_dependency) > 0:
           subs = right_dependency[0]
   return subs

On the finding of the objects, I got all the objects from verbs. You can follow up on the code below.

def get_all_objs(self, v, is_pas):
   #get list the right of dependency with VERB (VERB => "dobj" or "pobj")
   rights = list(v.rights)
   objs = [tok for tok in rights if tok.dep_ in ["dobj", "dative", "attr", "oprd", "pobj"] or (is_pas and tok.dep_ == 'pobj')]
   #get all objects from the right of dependency (VERB => "dobj" or "pobj")
   for obj in objs:
      #on the right of dependency, you can get objects from prepositions (VERB => "dobj" => "prep" => "pobj")
      rights = list(obj.rights) 
      objs.extend(self._get_objs_from_prepositions(rights, is_pas))
   return v, objs

** You can get objects from prepositions **

def _get_objs_from_prepositions(self, deps, is_pas):
   objs = []
   for dep in deps:
      if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
         objs.extend([tok for tok in dep.rights if tok.dep_  in ["dobj", "dative", "attr", "oprd", "pobj"] or (tok.pos_ == "PRON" and tok.lower_ == "me") or (is_pas and tok.dep_ == 'pobj')])
   return objs

Finally, I have to step the extraction SVO on the method.

def get_svo(self, sentence):
   doc = self.nlp(sentence)
   doc = self.merge_phrases()
   doc = self.merge_phrases()

   #check passive and active sentence
   is_pas = self.is_passive(doc)

   #find the main verb and child of a verb
   verbs = self.find_verbs(doc) 

   #more than verb
   for verb in verbs:
      self.sequence += 1

      #find the subject with the main verb
      subject = self.get_all_subs(verb)

      #find the object with the main verb                
      verb, obj = self.get_all_objs(verb, is_pas)

      #find prepositional modifier with the main verb  
      to_pobj = self.main_get_to_pobj(verb)

      #find prepositional modifier with the main verb
      if to_pobj is not None:
        self.svos.append((self.sequence, subject, verb, obj, to_pobj))
      else:
        self.svos.append((self.sequence, subject, verb, obj, ""))

Final Class

import spacy

class Phrases():
    def __init__(self, sentence):
       self.nlp = spacy.load('en_core_web_sm')
       self.sentence = str(sentence)
       self.doc = self.nlp(self.sentence)
       self.sequence = 0
       self.svos = []

    def merge_phrases(self):
        with self.doc.retokenize() as retokenizer:
            for np in list(self.doc.noun_chunks):
                    attrs = {
                        "tag": np.root.tag_,
                        "lemma": np.root.lemma_,
                        "ent_type": np.root.ent_type_,
                    }
                    retokenizer.merge(np, attrs=attrs)
        return self.doc

    def merge_punct(self):
        spans = []
        for word in self.doc[:-1]:
            if word.is_punct or not word.nbor(1).is_punct:
                continue
            start = word.i
            end = word.i + 1
            while end < len(self.doc) and self.doc[end].is_punct:
                end += 1
            span = self.doc[start:end]
            spans.append((span, word.tag_, word.lemma_, word.ent_type_))
        with self.doc.retokenize() as retokenizer:
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
        return self.doc

    def is_passive(self, tokens):
        for tok in tokens:
            if tok.dep_ == "auxpass":
                return True
        return False

    def _is_verb(self, token):
        return token.dep_ in ["ROOT", "xcomp", "appos", "advcl", "ccomp", "conj"] and token.tag_ in ["VB", "VBZ", "VBD", "VBN", "VBG", "VBP"]

    def find_verbs(self, tokens):
        verbs = [tok for tok in tokens if self._is_verb(tok)]
        return verbs

    def get_all_subs(self, v):
        #get all subjects
        subs = [tok for tok in v.lefts if tok.dep_ in ["ROOT", "nsubj", "nsubjpass"] and tok.tag_ in ["NN" , "NNS", "NNP"]]
        if len(subs) == 0:
            #get all subjects from the left of verb ("nsubj" <= "preconj" <= VERB)
            subs = [tok for tok in v.lefts if tok.dep_ in ["preconj"]]
            for sub in subs:
                rights = list(sub.rights)
                right_dependency = [tok.lower_ for tok in rights]
                if len(right_dependency) > 0:
                    subs = right_dependency[0]
        return subs

    def get_all_objs(self, v, is_pas):
        #get list the right of dependency with VERB (VERB => "dobj" or "pobj")
        rights = list(v.rights)
        objs = [tok for tok in rights if tok.dep_ in ["dobj", "dative", "attr", "oprd", "pobj"] or (is_pas and tok.dep_ == 'pobj')]
        #get all objects from the right of dependency (VERB => "dobj" or "pobj")
        for obj in objs:
            #on the right of dependency, you can get objects from prepositions (VERB => "dobj" => "prep" => "pobj")
            rights = list(obj.rights) 
            objs.extend(self._get_objs_from_prepositions(rights, is_pas))
        return v, objs

    def _get_objs_from_prepositions(self, deps, is_pas):
        objs = []
        for dep in deps:
            if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
                objs.extend([tok for tok in dep.rights if tok.dep_  in ["dobj", "dative", "attr", "oprd", "pobj"] or (tok.pos_ == "PRON" and tok.lower_ == "me") or (is_pas and tok.dep_ == 'pobj')])
        return objs

    def get_svo(self, sentence):
        doc = self.nlp(sentence)
        doc = self.merge_phrases()
        doc = self.merge_phrases()

        #check passive and active sentence
        is_pas = self.is_passive(doc)

        #find the main verb and child of a verb
        verbs = self.find_verbs(doc) 

        #more than verb
        for verb in verbs:
            self.sequence += 1

            #find the subject with the main verb
            subject = self.get_all_subs(verb)

            #find the object with the main verb                
            verb, obj = self.get_all_objs(verb, is_pas)

            #find prepositional modifier with the main verb  
            to_pobj = self.main_get_to_pobj(verb)

            #You can continue create method for extract word ...

            #finally, we can find prepositional modifier with the main verb
            if to_pobj is not None:
                #result SVO
                self.svos.append((self.sequence, subject, verb, obj, to_pobj))
            else:
                #result SVO
                self.svos.append((self.sequence, subject, verb, obj, ""))