Investigating the Effectiveness of Convolutional Neural Networks Combined with Word2Vec in Resume Classification

Abstract Organizations seek experienced candidates to drive their growth, but their primary challenge lies in identifying the right applicants. Each year, they receive a vast number of applications, making it difficult to sift through them and select the best candidates. Traditionally, this selection process has involved manually reviewing CVs or resumes, which is a daunting task. Automated resume classification streamlines recruitment by sorting candidates into job categories such as IT, Finance, Marketing, and HR. This study evaluates a hybrid model combining Convolutional Neural Networks (CNNs) with Word2Vec embeddings, tested on a dataset of 4,000 anonymized resumes. Compared to baselines Support Vector Machines (SVM) with TF-IDF (84% accuracy) and bag-of-words Naive Bayes (77% accuracy) the CNN-Word2Vec model achieved 91% accuracy, with precision, recall, and F1-scores averaging 90%. Statistical analysis confirmed its superiority (p < 0.05). The model’s success stems from Word2Vec’s semantic embeddings and CNN’s feature extraction, offering a scalable, efficient solution for HR automation. Future work could explore multilingual datasets and contextual embeddings like BERT. The study evaluates the performance of integrating CNN and Word2Vec in identifying job categories within resumes. A research question explores the comparison between CNN-Word2Vec models relative to conventional techniques concerning their accuracy and stability performance. Previous research investigated both skill extraction (Roy et al., 2018) and parsing (Zaroor et al., 2020) but scarce literature exists about job category classification through this hybrid method. Research objectives entail testing CNN-Word2Vec effectiveness and conducting baseline comparisons as well as functionality assessment. The research addresses this hole to advance AI recruitment methods which decrease human workloads and make recruitment more fair. The study examined 4000 anonymized job application texts gathered from a public job listing platform that contained equal numbers (1,000 each) of IT, Finance, Marketing and HR candidates and was verified through professional assessment with Cohen’s kappa of 0.85. All resumes underwent preprocessing by converting to lowercase followed by punctuation removal and stop word elimination then tokenization which produced an average number of 200 words in each resume. A Word2Vec CBOW model was trained to produce 300-dimensional vectors at a window size of 5 through 20 epochs to create the embeddings while a 2D PCA plot verified semantic relations by placing “java” near “python” together. The designed convolutional neural network (CNN) contained Word2Vec vector embedded layers and three convolutional filters of sizes 3, 4, 5 each with 128 outputs using ReLU nonlinearity before max-pooling and a fully connected softmax classification layer. Among the models used for classification, CNN-Word2Vec stood as the top performer by delivering a 91% accuracy rating and achieving precision, recall, and F1-scores of 90%, 91%, and 90% respectively. These metrics surpassed those of SVM (84% accuracy) and bag-of-words (77% accuracy) as confirmed by ANOVA (F = 12.3, p < 0.05). The distinct nomenclature in IT and Marketing sectors enabled 94% and 93% accuracy respectively but the overlap between Marketing and HR terminology led to misclassification errors in 88% of HR cases. The implementation revealed steady accuracy improvement up to 91% during epoch 10 based on the values displayed in the confusion matrix and learning curve. Computational power of CNN-Word2Vec equaled results achieved by Zaroor et al. (2020, 88%) while the model processed one thousand resumes within sixty seconds for bias reduction. The main limitations of this approach stemmed from its small English-only dataset coupled with insufficient support for multiple labels because researchers planned to focus on BERT technology in the future. The CNN-Word2Vec model significantly outperforms baselines in resume classification (91% accuracy), leveraging semantic embeddings and convolutional feature extraction. It offers a practical, efficient solution for HR automation. Future research should explore multilingual datasets, contextual embeddings (e.g., BERT), and multi-label classification to address overlapping roles, enhancing its transformative potential. Keywords:Convolutional Neural Network, Word2Vec, Natural language Processing, Resume Classification

Alawode J. A, Sodeinde V. O.

Download