DEV Community

nasircsecu
nasircsecu

Posted on • Updated on

Step by Step web application firewall (WAF) development by using multinomial native bayes algorithm

A web application firewall (WAF) is a firewall that monitors, filters and blocks web parameter as they travel to and from a website or web application. It typically protects web applications from attacks such as cross-site forgery, cross-site-scripting (XSS), file inclusion, and SQL injection, among others.A WAF is differentiated from a regular firewall in that a WAF is able to filter the content of specific web applications while regular firewalls serve as a safety gate between servers.

Web application firewall development step by using supervised machine learning:

*Step-1:prepare dataset*

To prepare the dataset, load the train dataset into a pandas dataframe containing two columns – txt_label and txt_text. txt_label contain attack type and txt_text contain the attack sample

trainDF = load_cvs_dataset(input_dataset)
txt_label = trainDF[payload_label]
txt_text = trainDF[payload_col_name]

this code segment found in train_model.py

def load_cvs_dataset(dataset_path):

    # Set Random seed
    np.random.seed(500)
    # Add the Data using pandas
    Corpus = pd.read_csv(dataset_path, encoding='latin-1', error_bad_lines=False)

    return Corpus 

this code segment found in dataset_load.py

*Step-2: Text Feature Engineering*
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement Count Vectors as features in order to obtain relevant features from our dataset.
*Count vectors as feature:*
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
clean the text from the each text document before the feature frequency matrix generation

 doc=re.sub("\d+"," ",doc)
 result_doc=word_tokenize(doc)
 tagged_sentence = nltk.pos_tag(result_doc)
 edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS' and tag != 'NNS' and tag != 'NN' and tag != 'JJ' and tag != 'JJR' and tag != 'JJS']

this code segment found in count_word_fit.py

after the cleaning text on each document generate the frequency matrix of feature on each document.

total_class_token = {}

    # print(vocabulary)
    class_eachtoken_count = {} 

    for class_label in class_labels: 
        total_class_token[class_label] = 0
        class_eachtoken_count[class_label] = {}
        for voc in vocabulary:
            class_eachtoken_count[class_label] [voc] = 0

    doccount = 0
    total_voca_count = 0
    for doc in doc_list:
        words = word_tokenize(doc);

        class_label = temp_class_labels[doccount]

        for word in words:
            if word in vocabulary:
                class_eachtoken_count[class_label][word] = class_eachtoken_count[class_label][word] + 1 
                total_class_token[class_label] = total_class_token[class_label] + 1
                #print("total_class_token is ",total_class_token)
                total_voca_count = total_voca_count + 1

        doccount = doccount + 1



this code segment found in count_word_fit.py

*Step-3: build the train model *
following code segment is the implementation of multinomial native bayes algorithm

def multi_nativebayes_train(model_data):
    #

    class_eachtoken_likelihood = {} 
    vocabulary = model_data.get_vocabulary()
    for class_label in model_data.get_class_labels(): 
        class_eachtoken_likelihood[class_label] = {}
        for voc in vocabulary:
            class_eachtoken_likelihood[class_label] [voc] = 0
    logprior={}
    vocabularyCount = model_data.get_vocabularyCount()
    class_eachtoken_count = model_data.get_class_eachtoken_count()
    for class_label in model_data.get_class_labels(): 


        total_class_token = model_data.get_total_class_token()

        logprior[class_label]=math.log(total_class_token[class_label] / vocabularyCount)

        for word in vocabulary:

            if(class_eachtoken_count[class_label][word]==0):
                class_eachtoken_likelihood[class_label][word]=0

            else:
                class_eachtoken_likelihood[class_label][word]=math.log(class_eachtoken_count[class_label][word] / total_class_token[class_label])
    train_model_data = train_model(logprior,class_eachtoken_likelihood,vocabulary,model_data.get_class_labels())       
    return train_model_data;

this code segment found in multinomial_nativebayes.py

step-4:test dataset predict
After the training process we get train model and saved it in the web server. Now put the list of test data which contain both normal and abnormal data and get the list of prediction result from train model.

def multi_nativebayes_verna_predict(train_model_data, test_dataset):

    condProbabilityOfTermClass = {}
    final_doc_class_label = {}
    doccount = 0;
    logprior = train_model_data.get_logprior()

    for doc in test_dataset:

        doc=re.sub("\d+", " ", doc)
        final_doc_class_label['doc' + '-' + str(doccount)] = ''
        words = word_tokenize(doc)
        score_Class = 0
        max_score = 0
        final_class_label = ''
        is_norm = 0


        for class_label in train_model_data.get_class_labels(): 
            condProbabilityOfTermClass[class_label] = 0

            logprior_val=logprior[class_label]
            for word in words:
                word=word.lower()
                get_class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
                vocabulary = train_model_data.get_vocabulary()
                if(word in vocabulary):

                    if(get_class_eachtoken_likelihood[class_label][word]==0):

                        condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
                    else:
                        condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + get_class_eachtoken_likelihood[class_label][word]
                else:

                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;

            if(condProbabilityOfTermClass[class_label] == 0):

                is_norm = 1  
                continue      
            score_Class = logprior_val + condProbabilityOfTermClass[class_label]
            if(max_score > score_Class):
                max_score = score_Class
                final_class_label = class_label

        if(is_norm == 1):
            final_doc_class_label['doc' + '-' + str(doccount)] = "norm" 
        else:         
            final_doc_class_label['doc' + '-' + str(doccount)] = final_class_label

        doccount = doccount + 1    


    return final_doc_class_label 

this code segment found in multinomial_nativebayes.py

At the final stage calculating accuracy level of algorithm in web parameter filtering

def accuracy_score(testlabelcopy, final_doc_class_label):
    label_count = 0
    wrong_count = 0
    for label in testlabelcopy:
        #print(final_doc_class_label['doc' + '-' + str(label_count)]+' '+str(label_count))
        if label != final_doc_class_label['doc' + '-' + str(label_count)] :
            wrong_count = wrong_count + 1
        label_count = label_count + 1

    accuracy = ((len(testlabelcopy) - wrong_count)*100 )/ len(testlabelcopy)

    return accuracy     

this code segment found in multinomial_nativebayes.py

Step-5: prediction on the text classification
On live this train model is used in text classification to verify or filter whether web parameter is normal data or vulnerable script.

def live_multi_nativebayes_verna_predict(train_model_data, input_doc):

    condProbabilityOfTermClass = {}

    doc=re.sub("\d+", " ", input_doc)
    final_doc_class_label = ''
    words = word_tokenize(doc)
    score_Class = 0
    max_score = 0
    final_class_label = ''
    is_norm = 0

    vocabulary = train_model_data.get_vocabulary() 
    logprior = train_model_data.get_logprior()
    class_label_list=train_model_data.get_class_labels()

    for class_label in class_label_list: 
        condProbabilityOfTermClass[class_label] = 0

        logprior=logprior[class_label]
        for word in words:
            word=word.lower()
            class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()

            if(word in vocabulary):

                if(class_eachtoken_likelihood[class_label][word]==0):

                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
                else:
                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + class_eachtoken_likelihood[class_label][word]
            else:

                condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;


        if(condProbabilityOfTermClass[class_label] == 0):

            is_norm = 1  
            continue      
        score_Class = logprior + condProbabilityOfTermClass[class_label]
        if(max_score > score_Class):
            max_score = score_Class
            final_class_label = class_label

    if(is_norm == 1):
        final_doc_class_label= "norm" 
    else:         
        final_doc_class_label = final_class_label


    return final_doc_class_label

this code segment found in multinomial_nativebayes.py

Oldest comments (4)

Collapse
 
sapnilcsecu profile image
nasircsecu

regular expression based venerable web parameter detection is used extensively in web application firewall (WAF) development but still no machine learning are not used most of the WAF development.because still now accuracy level of machine learning based solution cannot reach expected level .in my post i try to explain the way of development of WAF based on supervised machine learning.read my post and get start discuss about my github.com/sapnilcsecu/Web-applica... solution .if you find any limitation in this solution start discuss about this limitation

Collapse
 
dxsaki profile image
Md. Ashifur Rahman

need more explanation ABOUT YOUR FANTASTIC WEB FIREWALL PROGRAM

Collapse
 
sapnilcsecu profile image
nasircsecu

this is just initialisation. But i have plan about another article in which i will explain in detail how machine learning can be used in web application venerability detection on web application firewall(WAF)

Collapse
 
sharifkhan96 profile image
Sharifullah • Edited

Can we extend this project by adding some other features i.e. security, monitoring incomming web traffic, encrypted commmunication between client and server as a CS final year project to make a WAF? Your effort & response are appreciated.