Quora Insincere Questions Classification - Part 1/2

September 16, 2019 16 minute read

Side note : this is the first part of two, see the conclusion for the next part.

Context

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions – those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

Here’s your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.

Goal

Predict whether a question asked on Quora is sincere or not

An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

Has a non-neutral tone
Is disparaging or inflammatory
Isn’t grounded in reality
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

Dataset

The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Note that the distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and sanitization measures that have been applied to the final dataset.

Exploratory Data Analysis

Libraries import

import warnings
warnings.filterwarnings("ignore")

import numpy as np 
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from string import punctuation 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, classification_report

from xgboost import XGBClassifier
import lightgbm as lgb

File descriptions

train.csv - the training set
test.csv - the test set
sample_submission.csv - A sample submission in the correct format
enbeddings/ - (see below)

df = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
df.head()

	qid	question_text
0	00002165364db923c7e6	How did Quebec nationalists see their province...
1	000032939017120e6e44	Do you have an adopted dog, how would you enco...
2	0000412ca6e4628ce2cf	Why does velocity affect time? Does velocity a...
3	000042bf85aa498cd78e	How did Otto von Guericke used the Magdeburg h...
4	0000455dfa3e01eae3af	Can I convert montra helicon D to a mountain b...

Data fields

qid - unique question identifier
question_text - Quora question text
target - a question labeled “insincere” has a value of 1, otherwise 0

pd.read_csv("../input/quora-insincere-questions-classification/test.csv").head()

	qid	question_text
0	0000163e3ea7c7a74cd7	Why do so many women become so rude and arroga...
1	00002bd4fb5d505b9161	When should I apply for RV college of engineer...
2	00007756b4a147d2b0b3	What is it really like to be a nurse practitio...
3	000086e4b7e1c7146103	Who are entrepreneurs?
4	0000c4c3fbe8785a3090	Is education really making good people nowadays?

Basic infos and analysis of the target

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306122 entries, 0 to 1306121
Data columns (total 3 columns):
qid              1306122 non-null object
question_text    1306122 non-null object
target           1306122 non-null int64
dtypes: int64(1), object(2)
memory usage: 29.9+ MB

No Nan and no duplicated line :

df.duplicated().sum()

Target analysis

df.target.value_counts()

0    1225312
1      80810
Name: target, dtype: int64

df.target.describe()

count    1.306122e+06
mean     6.187018e-02
std      2.409197e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+00
Name: target, dtype: float64

plt.figure(figsize=(5, 4))
sns.countplot(x='target', data=df)
plt.title('Reparition of question by sincerity (insincere = 1)');

png

print(f'There are {df.target.sum() / df.shape[0] * 100 :.1f}% of insincere questions, which make the dataset highly unbalanced.')

There are 6.2% of insincere questions, which make the dataset highly unbalanced.

Word clouds

Generally, though, data scientists don’t think much of word clouds, in large part because the placement of the words doesn’t mean anything other than “here’s some space where I was able to fit a word.” Anyway, clouds can come handy to have a frist insight of the most common words…

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

print('Word cloud image generated from sincere questions')
sincere_wordcloud = WordCloud(width=600, height=400, background_color ='white', min_font_size = 10).generate(str(df[df["target"] == 0]["question_text"]))
#Positive Word cloud
plt.figure(figsize=(15,6), facecolor=None)
plt.imshow(sincere_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show();

Word cloud image generated from sincere questions

png

print('Word cloud image generated from INsincere questions')
insincere_wordcloud = WordCloud(width=600, height=400, background_color ='white', min_font_size = 10).generate(str(df[df["target"] == 1]["question_text"]))
#Positive Word cloud
plt.figure(figsize=(15,6), facecolor=None)
plt.imshow(insincere_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show();

Word cloud image generated from INsincere questions

png

Statistics form the question texts

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

# if needed
# nltk.download('stopwords')

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory.

import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',
 "she's",
 'should',
 "should've",
 'shouldn',
 "shouldn't",
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 "that'll",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 'these',
 'they',
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 've',
 'very',
 'was',
 'wasn',
 "wasn't",
 'we',
 'were',
 'weren',
 "weren't",
 'what',
 'when',
 'where',
 'which',
 'while',
 'who',
 'whom',
 'why',
 'will',
 'with',
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'y',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'}

def create_features(df_):
    """Retrieve from the text column the nb of : words, unique words, characters, stopwords,
    punctuations, upper/lower case char, title..."""
    
    df_["nb_words"] = df_["question_text"].apply(lambda x: len(x.split()))
    df_["nb_unique_words"] = df_["question_text"].apply(lambda x: len(set(str(x).split())))
    df_["nb_chars"] = df["question_text"].apply(lambda x: len(str(x)))
    df_["nb_stopwords"] = df_["question_text"].apply(lambda x : len([nw for nw in str(x).split() if nw.lower() in stop_words]))
    df_["nb_punctuation"] = df_["question_text"].apply(lambda x : len([np for np in str(x) if np in punctuation]))
    df_["nb_uppercase"] = df_["question_text"].apply(lambda x : len([nu for nu in str(x).split() if nu.isupper()]))
    df_["nb_lowercase"] = df_["question_text"].apply(lambda x : len([nl for nl in str(x).split() if nl.islower()]))
    df_["nb_title"] = df_["question_text"].apply(lambda x : len([nl for nl in str(x).split() if nl.istitle()]))
    return df_

df = create_features(df)
df.sample(2)

	qid	question_text	target	nb_words	nb_unique_words	nb_chars	nb_stopwords	nb_punctuation	nb_uppercase	nb_lowercase	nb_title
685172	86319861df6a171eced7	What are some of the least economically develo...	0	9	9	60	5	1	0	8	1
302872	3b50058a0796f933924b	Who invented ASMR?	0	3	3	18	1	1	1	1	1

Let’s take a sample - because the data set is quite huge when run locally on a single node - and visualize pair plots :

num_feat = ['nb_words', 'nb_unique_words', 'nb_chars', 'nb_stopwords', \
            'nb_punctuation', 'nb_uppercase', 'nb_lowercase', 'nb_title', 'target'] 
# side note : remove target if needed later

df_sample = df[num_feat].sample(n=round(df.shape[0]/6), random_state=42)

plt.figure(figsize=(16,10))
sns.pairplot(data=df_sample, hue='target')
plt.show()

<Figure size 1152x720 with 0 Axes>

png

Basic stats comparison :

df_sample[df_sample['target'] == 0].describe()

	nb_words	nb_unique_words	nb_chars	nb_stopwords	nb_punctuation	nb_uppercase	nb_lowercase	nb_title	target
count	204532.000000	204532.000000	204532.000000	204532.000000	204532.000000	204532.000000	204532.000000	204532.000000	204532.0
mean	12.509334	11.880190	68.885475	6.043426	1.706897	0.459860	10.035466	2.062655	0.0
std	6.751813	5.781951	36.732624	3.620446	1.545802	0.842419	6.172322	1.431942	0.0
min	2.000000	2.000000	10.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0
25%	8.000000	8.000000	44.000000	4.000000	1.000000	0.000000	6.000000	1.000000	0.0
50%	11.000000	10.000000	59.000000	5.000000	1.000000	0.000000	8.000000	2.000000	0.0
75%	15.000000	14.000000	83.000000	7.000000	2.000000	1.000000	12.000000	3.000000	0.0
max	56.000000	48.000000	319.000000	36.000000	65.000000	14.000000	51.000000	21.000000	0.0

df_sample[df_sample['target'] == 1].describe()

	nb_words	nb_unique_words	nb_chars	nb_stopwords	nb_punctuation	nb_uppercase	nb_lowercase	nb_title	target
count	13155.000000	13155.000000	13155.000000	13155.000000	13155.000000	13155.000000	13155.000000	13155.000000	13155.0
mean	17.338502	16.073280	98.295325	8.063398	2.382516	0.339187	13.960775	2.973014	1.0
std	9.697397	8.224487	55.799960	5.056734	3.991050	1.029390	8.770353	1.992830	0.0
min	1.000000	1.000000	7.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0
25%	10.000000	10.000000	54.000000	4.000000	1.000000	0.000000	7.000000	2.000000	1.0
50%	15.000000	14.000000	86.000000	7.000000	2.000000	0.000000	12.000000	3.000000	1.0
75%	23.000000	21.000000	130.000000	11.000000	3.000000	0.000000	19.000000	4.000000	1.0
max	60.000000	47.000000	878.000000	37.000000	372.000000	37.000000	53.000000	30.000000	1.0

Generally speaking, insincere questions are written with more words.

Now with a focus on the distributions, because there is a difference in the spikes between sincere and insincre questions.

plt.figure(figsize=(10,10))
plt.subplot(331)

i=0
for c in num_feat:
    plt.subplot(3, 3, i+1)
    i += 1
    sns.kdeplot(df_sample[df_sample['target'] == 0][c], shade=True)
    sns.kdeplot(df_sample[df_sample['target'] == 1][c], shade=False)
    plt.title(c)

plt.show()

png

Same conclusion here than shown with stats

Obviously, many of these indicators are highly correlated each other but not towards the target :

sns.set(style="white")

# Compute the correlation matrix
corr = df_sample[num_feat].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d99b5eba8>

png

What are the most frequent words for each type of question ?

class Vocabulary(object):
    # credits : Shankar G see https://www.kaggle.com/kaosmonkey/visualize-sincere-vs-insincere-words
    
    def __init__(self):
        self.vocab = {}
        self.STOPWORDS = set()
        self.STOPWORDS = set(stopwords.words('english'))
        
    def build_vocab(self, lines):
        for line in lines:
            for word in line.split(' '):
                word = word.lower()
                if (word in self.STOPWORDS):
                    continue
                if (word not in self.vocab):
                    self.vocab[word] = 0
                self.vocab[word] +=1 
    
    def generate_ngrams(text, n_gram=1):
        """arg: text, n_gram"""
        token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]
    
    def horizontal_bar_chart(df, color):
        trace = go.Bar(
            y=df["word"].values[::-1],
            x=df["wordcount"].values[::-1],
            showlegend=False,
            orientation = 'h',
            marker=dict(
            color=color,
            ),
        )
        return trace

sincere_vocab = Vocabulary()
sincere_vocab.build_vocab(df[df['target'] == 0]['question_text'])
sincere_vocabulary = sorted(sincere_vocab.vocab.items(), reverse=True, key=lambda kv: kv[1])
    
df_sincere_vocab = pd.DataFrame(sincere_vocabulary, columns=['word_sincere', 'frequency'])
sns.barplot(y='word_sincere', x='frequency', data=df_sincere_vocab[:20])

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d99d79ac8>

png

insincere_vocab = Vocabulary()
insincere_vocab.build_vocab(df[df['target'] == 1]['question_text'])
insincere_vocabulary = sorted(insincere_vocab.vocab.items(), reverse=True, key=lambda kv: kv[1])

df_insincere_vocab = pd.DataFrame(insincere_vocabulary, columns=['word_insincere', 'frequency'])
sns.barplot(y='word_insincere', x='frequency', data=df_insincere_vocab[:20])

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d86abfa58>

png

As we can clearly see there are certain words (swear words, discriminatory words based on race, political figures etc) that show up a lot in insincere sentences.

Text processing & model training

Metric : F-score

The most appropriated metric is F1-score. Explanation from Wikipedia:

“In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.”

png

we’ll also use a confusion matrix the most import part is the recall for insincere questions (class 1), because if the recall isn’t good enough, the f1-score won’t satisfying either. Here are mathematical definitions :

png

def get_fscore_matrix(fitted_clf, model_name):
    print(model_name, ' :')
    
    # get classes predictions for the classification report 
    y_train_pred, y_pred = fitted_clf.predict(X_train), fitted_clf.predict(X_test)
    print(classification_report(y_test, y_pred), '\n') # target_names=y
    
    # computes probabilities keep the ones for the positive outcome only      
    print(f'F1-score = {f1_score(y_test, y_pred):.2f}')

Text processing

# if needed the first time  
# import nltk
# nltk.download('punkt')

Process : - tokenization - keeping only alphanumeriacl characters - removing stop words (punctuation etc…) - stemming or lemmatization.

source : blog.bitext.com

Tokenization : Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation

Stemming vs. lemmatization : The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root.

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations.

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.

df = df[['question_text', 'target']]

def text_processing(local_df):
    """ return the dataframe with tokens stemmetized without numerical values & stopwords """
    stemmer = PorterStemmer()
    # Perform preprocessing
    local_df['txt_processed'] = local_df['question_text'].apply(lambda df: word_tokenize(df))
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [item for item in x if item.isalpha()])
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [item for item in x if item not in stop_words])
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [stemmer.stem(item) for item in x])
    return local_df

df = text_processing(df)
df.tail(2)

	question_text	target	txt_processed
1306120	How can one start a research project based on ...	0	[how, one, start, research, project, base, bio...
1306121	Who wins in a battle between a Wolverine and a...	0	[who, win, battl, wolverin, puma]

First method : text similarity using TF-IDF

vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x, min_df=0.01, max_df=0.999)
# min_df & max_df param added for less memory usage

tf_idf = vectorizer.fit_transform(df['txt_processed']).toarray()
pd.DataFrame(tf_idf, columns=vectorizer.get_feature_names()).head()

	Do	I	can	...	whi	would
0	0.000000	0.000000	0.00000	...	0.000000	0.000000
1	0.606073	0.000000	0.00000	...	0.000000	0.554503
2	0.000000	0.000000	0.00000	...	0.416315	0.000000
3	0.000000	0.000000	0.00000	...	0.000000	0.000000
4	0.000000	0.360439	0.56217	...	0.000000	0.000000

5 rows × 76 columns

# Split the data
X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['target'], test_size=0.2, random_state=42)

XGBoost Classifier without weigths

model = XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
get_fscore_matrix(model, 'XGB Clf withOUT weights')

XGB Clf withOUT weights  :
              precision    recall  f1-score   support

           0       0.94      1.00      0.97    245369
           1       0.61      0.05      0.10     15856

   micro avg       0.94      0.94      0.94    261225
   macro avg       0.78      0.53      0.53    261225
weighted avg       0.92      0.94      0.92    261225
 

F1-score = 0.10

XGBoost Classifier with weigths

ratio = ((len(y_train) - y_train.sum()) - y_train.sum()) / y_train.sum()
ratio

14.086722911598978

model = XGBClassifier(objective="binary:logistic", scale_pos_weight=ratio)
model.fit(X_train, y_train)
get_fscore_matrix(model, 'XGB Clf WITH weights')

XGB Clf WITH weights  :
              precision    recall  f1-score   support

           0       0.98      0.81      0.89    245369
           1       0.19      0.68      0.30     15856

   micro avg       0.81      0.81      0.81    261225
   macro avg       0.58      0.75      0.59    261225
weighted avg       0.93      0.81      0.85    261225
 

F1-score = 0.30

Now LGBM with weights

model = lgb.LGBMClassifier(n_jobs = -1, class_weight={0:y_train.sum(), 1:len(y_train) - y_train.sum()})
model.fit(X_train, y_train)
get_fscore_matrix(model, 'LGBM weighted')

LGBM weighted  :
              precision    recall  f1-score   support

           0       0.98      0.73      0.84    245369
           1       0.16      0.80      0.26     15856

   micro avg       0.73      0.73      0.73    261225
   macro avg       0.57      0.76      0.55    261225
weighted avg       0.93      0.73      0.80    261225
 

F1-score = 0.26

LogisticRegression

model = LogisticRegression(class_weight={0:y_train.sum(), 1:len(y_train) - y_train.sum()}, C=0.5, max_iter=100, n_jobs=-1)
model.fit(X_train, y_train)
get_fscore_matrix(model, 'LogisticRegression')

LogisticRegression  :
              precision    recall  f1-score   support

           0       0.98      0.72      0.83    245369
           1       0.15      0.79      0.26     15856

   micro avg       0.73      0.73      0.73    261225
   macro avg       0.57      0.75      0.54    261225
weighted avg       0.93      0.73      0.80    261225
 

F1-score = 0.26

Second approach : a CountVectorizer / Logistic Regression pipeline

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

df['str_processed'] = df['txt_processed'].apply(lambda x: " ".join(x))
df.head(2)

	question_text	target	txt_processed	str_processed
0	How did Quebec nationalists see their province...	0	[how, quebec, nationalist, see, provinc, nation]	how quebec nationalist see provinc nation
1	Do you have an adopted dog, how would you enco...	0	[Do, adopt, dog, would, encourag, peopl, adopt...	Do adopt dog would encourag peopl adopt shop

pipeline = Pipeline([("cv", CountVectorizer(analyzer="word", ngram_range=(1,4), max_df=0.9)),
                     ("clf", LogisticRegression(solver="saga", class_weight="balanced", C=0.45, max_iter=250, verbose=1, n_jobs=-1))])

X_train, X_test, y_train, y_test = train_test_split(df['str_processed'], df.target, test_size=0.2, stratify = df.target.values)

lr_model = pipeline.fit(X_train, y_train)
lr_model

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.

max_iter reached after 431 seconds

/home/sunflowa/Anaconda/lib/python3.7/site-packages/sklearn/linear_model/sag.py:334: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  7.3min finished

Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.9, max_features=None, min_df=1,
        ngram_range=(1, 4), preprocessor=None, stop_words=None,
        strip_a...penalty='l2', random_state=None,
          solver='saga', tol=0.0001, verbose=1, warm_start=False))])

get_fscore_matrix(lr_model, 'lr_pipe')

lr_pipe  :
              precision    recall  f1-score   support

           0       1.00      0.98      0.99    245063
           1       0.75      0.94      0.83     16162

   micro avg       0.98      0.98      0.98    261225
   macro avg       0.87      0.96      0.91    261225
weighted avg       0.98      0.98      0.98    261225
 

F1-score = 0.83

Conclusion, submission and opening

First we have used TF-IDF, but the least we can say : this is not really efficient, the recall for insincere question isn’t good at all, so this seems to not be the right way to go…
Instead, using CountVectorizer with a Logistic Regression is more efficient.

Now let’s see what will be the submission score ?

pd.read_csv("../input/quora-insincere-questions-classification/sample_submission.csv").head(2)

	qid	prediction
0	0000163e3ea7c7a74cd7	0
1	00002bd4fb5d505b9161	0

df_test = pd.read_csv("../input/quora-insincere-questions-classification/test.csv", index_col='qid')
df_test.tail(2)

	question_text
qid
ffffb1f7f1a008620287	What are the causes of refraction of light?
fffff85473f4699474b0	Climate change is a worrying topic. How much t...

df_test = text_processing(df_test)
df_test['str_processed'] = df_test['txt_processed'].apply(lambda x: " ".join(x))
df_test.head(2)

	question_text	txt_processed	str_processed
qid
0000163e3ea7c7a74cd7	Why do so many women become so rude and arroga...	[whi, mani, women, becom, rude, arrog, get, li...	whi mani women becom rude arrog get littl bit ...
00002bd4fb5d505b9161	When should I apply for RV college of engineer...	[when, I, appli, RV, colleg, engin, bm, colleg...	when I appli RV colleg engin bm colleg engin s...

y_pred_final = lr_model.predict(df_test['str_processed'])
y_pred_final

array([1, 0, 0, ..., 0, 0, 0])

df_submission = pd.DataFrame({"qid":df_test.index, "prediction":y_pred_final})
df_submission.head()

	qid	prediction
0	0000163e3ea7c7a74cd7	1
1	00002bd4fb5d505b9161	0
2	00007756b4a147d2b0b3	0
3	000086e4b7e1c7146103	0
4	0000c4c3fbe8785a3090	0

df_submission.to_csv('submission.csv', index=False)

Submission score = 0.61580 not that bad !

CREDITS : all the people mentionned above and especially amokrane & moneynass for their inspiring work ! thanks :)

-> IN THE 2nd PART I’LL USE WORD ENBEDDINGS !

Share on

Twitter Facebook LinkedIn

Olivier Brunet