Generation Theory in HR Practice: Text Mining for Talent Management Case (исследование поколений с помощью машинного обучения) на англ.

AUTHORS: Nikita Nikitinsky (NAUMEN), Polina Kachurina (ITMO University),Shashev  Sergey (Tsentr Razrabotki), Evgeniya Shamis (Sherpa S Pro, RANEPA, IBS)


Automation of talent and skills management is becoming an increasingly popular strategic tool in business and government institutions.

In this paper, we describe the possible applications of Generation Theory for Human Resource Management in various business and government organizations, briefly introduce the Decision Support System for Talent Management (DSSTM) and present an attempt to classify persons from two different generations based on texts they produce. To cope with this task, we apply Text Mining techniques, namely, LSA-based key term extraction and Word2Vec model for word embeddings. The experiments show promising results.


Text Mining; Human Resource Management; Generation Theory;
key-term extraction; Talent Management;


A cross-disciplinary approach in management, which combines both social psychology and statistics, may be in high demand right now. Technological advancement has brought us to a time when many organizational processes and functional tasks have become automated. Recent scientific modeling of social processes may help to optimize HRM, but the current theories of behavioral research are limited in terms hypothesis testing.

In this article, we study the differences between two generations and try to create a text-based classifier to distinguish persons from different generations. By using inferential statistics, we aim to test statistical methods in order to prove our study hypothesis. According to Kerlinger (1973), there is need for both operational and experimental terms to be conveyed. A statistical hypothesis expresses an aspect of the original theoretical hypothesis in quantitative and statistical terms. It is “a prediction of how the statistics used in analyzing the quantitative data of a research problem will turn out” (Kerlinger, 1973, p. 201). Currently our research combines a theoretical approach of testing statistical experimentation in behavioral studies with the goal of deriving applicable practical algorithms for the automated categorization of social groups by values.

Nowadays HRM is popular as a strategic tool in business. Talent management receives special attention and encompasses activities in the field of HRM, involving employees in the innovation process, as well as the formation of creative incentives and the development of creativity potential in employees (Singh, 2010).

Data mining is an approach capable of providing in-depth data analysis, which also has the ability to detect hidden data values (Sadath, 2013). The analysis of subjects’ work assessment results combined with the data collected during their appraisal periods and resumes, reveals certain career growth patterns of employees. Data mining gives HR departments the power to carefully analyze employees’ previous work experience and assess their future professional development potential (Al-Radaideh & Nagi, 2012); fully and thoroughly examine the results of employee testing, certification and training programs; planning of future actions. Data is an important tool for underlining motivational aspects for employees (Jantan, Hamdan & Othman, 2011). The current trend in Data Mining for HRM is text analysis.

Talent management systems in the Russian market are only trying to enter the larger business market. Yet Talent Management and Data Mining are currently used by some eGovernment systems and government organizations, as eGovernment is often the area of implementation of innovative solutions. For example, Directorate of Scientific and Technical Programmes of Russian Federation utilizes advanced Text Mining approaches for their Xpir project1 and conducts researches on Talent Management: for example, automatically selecting the best experts for certain areas of expertize based on multiple factors, including texts composed by the experts (Nikitinsky, Shashev, Kachurina & Bespalov, 2016).


The key notion of GT, which was developed in 1991 by US researches and practitioners Neil Howe and William Strauss who based their research on the specific traits of various generations (Strauss & Howe, 1991) and adapted it to socio-economic conditions of other countries, such as Japan, South Africa, is the concept of a “generation”. Generation is a group of people born in a certain period of time The process of Generational value formation is considered complete when individuals reach adolescence (12 years) and formulate their own individual world views (Reyan, 2016; Myers, 2012). Generational values are formed under the influence of four factors: social events, upbringing and schooling practices, media and visual environment (Strauss & Howe, 1991), as well as shortages and crisis periods (Shamis & Nikonov, 2016). These values are formed during childhood and remain constant throughout the lifetime. The majority of cases values are traced and monitored after a generation has reached the age of 18 and remain constant after. Traditionally, one generation generally covers a period of 20 years with rare exceptions (Strauss & Howe, 1991). Members of a certain generation identify themselves as a separate group of people who have a point of view that is unique from that of other generations. The approach was first implemented by Russian researchers in the RuGenerations project2 in 2002.

The following generations are defined for the 20th and 21st century in Russia (Shamis & Nikonov, 2016), USA3, South Africa.
Table 1. Generation names

Name of generations (traditional) Years of birth in Russia Years of birth in USA
GI 1900 — 1923 1901 — 1924
Silent 1923 — 1943 1925 — 1942
Baby-Boomers 1943 — 1963 1943 — 1960
X 1964 — 1984 1961 — 1981
Millennium (Millennial) 1985 — 2003 (?*) 1982 — 2004
Z 2004(?)* — 2024 (?)* 2005 (?)* — (?)*

*The date is preliminary and will only be conclusively established after Generation Z reaches the age of 18. At this point it will be given a new title.

The GT’s interdisciplinary nature combines approaches used in economics, history, sociology, and psychology in order to identify Generation values which are shared by large groups of individuals. This allows researchers to analyze the behavioral similarities and differences, as well as establish interconnections among study subjects. Extensive amounts of data are challenging to measure due to the complications that may occur when attempting to extract separate elements. The potential application of this approach in HRM especially for government organizations is our main area of focus in this study.


Let us examine several examples of Generation value usage in HRM. We will begin by introducing a specific case. A big insurance company was faced with the task of developing several different versions of a web site intended to attract different generations. The prototypes of these web-site versions were developed by representatives of three generations: BabyBoomers, Generation X, and Millennials. These variants caused a conflict with the values of other generations in relation to the language and images. The main basis of one conflict — opposing realities which were typical for other generations. The conflict arose due to two different slogan interpretations of concerning the notion of a bicycle. To Millennials, the bicycle symbolized a healthy active lifestyle, prioritized for this generation. Their slogan reflected their Generational values: “Move the pedals”. Russian Generation X had less appealing associations. Their slogan caused conflict due to its pessimistic nature: “Move away from here, until not beaten” (Russian idiom: “Крути педали, пока не дали”).

Due to the change of attraction and recruiting practices, a Russian production company managed to attract a lot more Millennials. Generation X, however, stopped seeing themselves as viable employee candidates for this company. As a result, the company had to develop double-Generation practices in order to attract two generations simultaneously.

In order to prove the success of using Generation methods for HRM, we need to determine if there is a possibility to define generation-specific traits and to personalize the choice of instruments best used for each generation with the help of automation. If this possibility exists, companies will be able to develop personalized toolkits fitted for each employee.


In this chapter, we introduce the basic description of the DSSTM system based on a text mining approach. This enables automated competency assessment (CA) aimed at overcoming a limited and human-biased approach to employee CA.

The system was designed to cope with various talent management tasks. Currently, for CA the system combines the information from three large sources: employee HR profiles, all the text documents produced by employees (e.g. scientific publications, work reports), and the results of professional skill tests and other traditional CA methods.

Competence evaluation consists of two steps: identification step — the system checks the presence of certain competencies attributed to an employee; and evaluation step — competence is being evaluated with the help of modifiers.

DSSTM uses a rule-based approach for competence identification. Based on possible features, obtained from worker profiles, text documents or professional skill test results, DSSTM users create rules in order to identify the presence of competence for an employee.

The general formula for CA may thus be expressed as:

where — competence score, — basic score, B-basic parameter modifier score, HR modifier score and TXT-text modifier score.

The basic score reflects the evidence of the identified competence. It is calculated as a ratio of the result of competence identification for a certain employee to the maximum result of competence identification in a department or organization as a whole. Each modifier score is computed with the following formula:

where, — the adjusted modifier element from 0 to 1 (for computational convenience), imp – the weight for each element (by default equal 1), gl – global weight for the element (by default equal 1/n);
The CA algorithm is customizable and allows adding more rules and parameters to the formulas.

The DSSTM system is currently undergoing approbation. We uncovered some limitations of the present DSSTM model – and one of them is the ability of the system to estimate only technical competences.

In this paper, we research the possibility of applying aspects of the GT to the system as a potential parameter in order to improve the quality of CA and estimation of professional interests.


6.1 Data

For this study, 600 people were randomly selected from the social network web-sites Facebook and VK. Only users whose profiles satisfied strict requirements were chosen. The requirements were as follows:

  • Users had to be born between 1968 and 1981 (for Generation X, excluding borderline years) or between 1989 and 1999 (for Generation Millennium, excluding borderline years);
  • Users had to be from one of the five largest cities in Russia (Moscow, Saint Petersburg, Novosibirsk, Samara, or Yekaterinburg);
  • Selected users were required to have more than five Russian language texts in social networks (wall posts, comments, etc.), containing no less than 200 words.
  • Users were required to have written the texts themselves (no copies or reposts).

We obtained two sets of texts that were related to individuals of two different generations (300 individuals for each generation, evenly distributed by five cities). We excluded borderline years in order to avoid relative differences and to focus on statistically significant cases and conducted several experiments in order to create a model, able to detect a generation of a person based only on the texts produced by that person.

6.2 Pre-processing

For each text, we tokenized the contents. During the tokenization process we removed all non-word characters (e.g. punctuation marks) and stop-words. In our case, we considered interjections, pronouns and words shorter than three symbols stop-words.

Then, a morphological analysis was conducted. This process included lemmatization and part-of-speech tagging. Finally, key words were extracted from the texts of every individual, and a vector for each individual was created in the Word2Vec space.

Word2Vec is a neural network model used for word embedding analysis. The assumption behind the model is that words located in similar contexts tend to possess semantic closeness (i.e. similar meanings). The model supports two architectures: continuous bag-of-words (CBOW) and continuous skip-gram. In DSSTM, we utilized the skip-gram approach as it has been found to be more suitable for working with texts containing less frequently occurring words (Mikolov, Chen, Corrado & Dean, 2013).

Word2Vec was selected as a main tool for conducting the experiments, because most of new word embedding techniques rely on neural network architectures and this approach shows high quality results in many practical studies. What is also important, Word2Vec model has a number of well-functioning open-source implementations.

The pre-trained Word2Vec model was implemented. Namely, the Web corpus from RusVectores was used (Kutuzov, A, Andreev, I, 2015). The Web corpus model was trained on a collection of random Russian web pages which had been crawled in December 2014, which came to a total of 9 million documents in total (corpus size is 660 628 738 tokens). The model knew 353 608 different lemmas; lemmas occurring less than 30 times were ignored. The model was trained using Continuous Skip-Gram algorithm, and the vector dimensionality was set to 500, window size 2.

6.3 Key term extraction algorithm

In the DSSTM, we use a complex approach to key term extraction. Namely, we employ a combination of LSA and a rulebased approach.

LSA is a natural language processing technique, which analyzes relationships between a set of documents and the terms they contain (Landauer, McNamara, Dennis & Kintsch, 2007). The assumption behind the algorithm is like that of word2vec and states that words occurring in the same contexts tend to share similar meanings. LSA constructs a weighted term-document matrix, where rows represent unique words and columns represent documents. LSA is based on the mathematical technique called SVD (singular value decomposition), which approximates the initial matrix with the lower rank matrix. As a result, noise is significantly reduced, while the weight of the lemma is increased.

For every text document two types of key terms had been extracted: one local and one global. Local key terms consisted of words contained in the analyzed document, whereas global key terms included lemmas from all corpora of documents. Local key terms were intended to describe the document itself, while global key terms had to describe the subject area of the document.

For this study, we worked only with local key words, contained in texts of certain individuals in order to maintain the integrity of the study.
The algorithm for local key term extraction was the following:

  • Candidates for key terms were selected from the document with preliminary defined rules. The rules considered different parameters, including parts of speech and morphological information.
  • The similarities between each candidate-term vector and document vector in LSA space (cosine similarity) were estimated.
  • Then, top-n terms (n can be assigned by user, by default n = 30) with a cosine similarity above a certain threshold were selected.

As a result, human-interpretable key terms were obtained. For example, top-10 key-terms form text of a randomly selected person from generation X were: размышление, заниженная самооценка, саморазрушение, самоуверенность, сознание, боль, мигрень, жизнь, лекарство, принимать таблетки (in English: reflection, low self-esteem, self-destruction, overconfidence, consciousness, pain, migraine, life, medicine, take pills).

6.4 Analysis

The initial idea was to compare texts to each generation’s values which had been expressed as words (i.e. freedom, professionalism etc.) in a semantic space. The text which had been created by a person from Generation X would be semantically closer to the life values of Generation X, as opposed to the values of a text belonging to Generation Millennium.

Therefore, we created two vectors in the Word2Vec semantic space for Generations X and Millennium, which had been composed from the averaged vectors of each generation’s values expressed as words. Then we vectorized texts for every person in the same Word2Vec space and compared them to the averaged vectors of Generation values.

The results were obviously heavily lexicologically biased (i.e. they depended on form of words not their meaning), so we started to lean towards a discourse-based approach, which indicated that texts should be compared to texts instead of abstract life values.

6.5 Experiment

During the experiment phase, 25% of all test subject vectors for every generation were randomly selected. Then, the vectors were averaged into one, thus obtaining the so called “generation vector”. As a result, the Generation X and Generation Millennium vectors were established.

The other 75% of individuals from each generation were selected as a validation sample. Each vector from the validation sample was iteratively compared to the Generation X and Generation Millennium vectors. Each individual from validation sample was related to a certain generation if vector of his or her texts was closer to certain generation as opposed to another.

Overall, 20 iterations were conducted. Each time the generation samples were randomly selected in order to verify the stability of the approach. The following results were obtained:

Iteration Accuracy for Generation X Accuracy for Generation Millenium
1 1 1
2 1 0.97
3 0.87 0.93
10 0.8 0.987
20 1 0.92
Average for 20 iterations 0.951 0.953

The experiment results showed that the discourse-based approach to generation classification could be used for the DSSTM system.

6.6 Results and Discussion

The results of the experiment show that the discourse-based approach to Generation classification might be utilized as an additional parameter for the DSSTM or other similar systems. This classifier may help HR-managers to better understand the life values of certain individuals born in borderline years for two types of generations (e.g. Generations X and Millennium) and may belong to either one of these generations.

We detected a semantic closeness between texts composed by individuals from one generation, and that closeness was not recognizably influenced by other factors (e.g. place of residence, gender etc.). Regardless, the life values of generations were thought to not have influenced the results either.

This model shows promising results, but it should be retested on newly acquired data frequently, due to the fact that individuals write texts on social network web-sites under the influence of various social or political issues. Since these issues tend to change over the course of time, the model built on the basis of current texts will tend to produce worse classification results in future experiments.

Future research on this topic may include testing the obtained model in one or two years in order to verify the assumption that the quality of text classification will be worse.


Cases of data analysis, data mining, mathematical modelling, automated assessment evaluation, as well as other cases of technological involvement in HRM have all shown promising results. Most people possess not only straight-forward characteristics, which are easy to estimate, but also abstract and non-linear features. Pre-transactional study of one stand-alone individual may be incorrect for different models of behaviour for this same individual in different situations and altered social environments. Skill and knowledge management seems to be a simpler option for statistical analysis. Cognitive style and communication specialities present more of a challenge, but still possible. While values are being studied by sociologists, it is still next to impossible to classify different public expressions of human personality traits and to connect them with personal values.

Thus, one of the possible applications of DSSTM and the classification model for government may be improvement and further automation of selection of experts for grant proposal evaluation, which is the case for Ministry of Science and Education. To speak more generally, Text Mining approach is applicable to governmental institutions dealing with research and development, where government officers and contractors create many unique texts.

In the study, we tried to connect previously known basic conclusions about different generations and their value orienteers with ways, in which they may be expressed in an individual’s speech. The experiments showed no obvious correlation between public expressions of opinion and speech peculiarity with Generation values.

First of all, opinion expression has its own specific set of features. Not only is it public and permanent (archives may store individual posts infinitely), but it also takes on the role of image modelling (people want to play their preferable roles by constructing a particular social image).

Secondly, it is hard to distinguish the values of a person based on just lexically expressed topics. The online communication is more reaction-based than creation-based, so people are more likely to express their opinions reacting to different events or objects. On the other hand, individuals are less likely to create their own independent discussions that may seem irrelevant in their current communication environment.

Thirdly, these days it is quite difficult to build data-mining systems with an accurate correlation to values perception. As of now, it seems unlikely that somebody can create a machine learning system that would have the ability to interpret human values.

Even so, this model has produced interesting results. Our research has brought us to the conclusion that if we were to take into consideration the more complicated modeling of the reflection of values in day-to-day human behavior, we could potentially raise the system’s accuracy of interpretation.

Thus, future studies should aim to bring forth the main aspects of complicated terms of human values and to plan the construction of a special environment, which would be less influenced by the purity of human personality expression.


The Ministry of Education and Science of the Russian Federation supported the research reported in this publication. The Unique ID of this research project is RFMEFI57914X0091.
We also would like to acknowledge the hard work and commitment from Alexey Nesterenko throughout the study


[1] Al-Radaideh, Q., Nagi, E. (2012). Using data mining techniques to build a classification model for predicting employee performance.International Journal of Advanced Computer Science and Applications, 3(2).

[2] Jantan, H., Hamdan, A., &Othman, Z. (2011). Data Mining Classification Techniques for Human Talent Forecasting. Retrieved from: forecasting.pdf

[3] Kerlinger, F. (1973). Foundations of Behavioral research. New York: Holt, Rinehart and Winston.

[4] Kutuzov, A., Andreev, I. (2015) Texts in, meaning out: neural language models in semantic similarity task for Russian, in Proceedings of the Dialog 2015 Conference. Moscow, Russia

[5] Landauer, T. K., McNamara, D. S., Dennis, S., &Kintsch, W. (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ: Lawrence Erlbaum Associates.

[6] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Retrieved from

[7] Myers, D. (2012). Social psychology. New York: McGrawHill.

[8] Nikitinsky N., Shashev S., Kachurina P., Bespalov A. (2016). Big Data and Machine Learning in E-government: Automatic Expert Evaluation Case // EEML 2016 – The 3rd International Workshop on Experimental Economics and Machine Learning. — pp. 111-122

[9] Reyan, A. (2016), Psihologiya Lichnosti [Personality Psychology]. Saint Petersburg: Piter.

[10] Sadath, L. (2013). Data mining: A tool for knowledge management in human resources, International Journal of Innovative Technology and Exploring Engineering, 2 (6), 2278-3075.

[11] Shamis, E., Nikonov, E. (2016) Teoriya Pokoleniy: Neobyknoveniy X. [GT: The Incredible X] Moscow: Sinergiya.

[12] Singh,T.(2010). Contemporary trends in talent management- strategizing towards building strong organisation, Abhinav National Monthly Refereed Journal of Research in Commerce & Management, 3 (10)

[13] Sredniy klass v sovremennoy Rossii: 10 let spustya. [Middle class in modern Russia: 10 years later]. Institut. RAS v sotrudn. s. Predstav. Fond. F. Erberta. 2014. Retreieved from lass/full.pdf

[14] Strauss, W. & Howe, N. (1991). Generations. New York: Morrow.


Nikita Nikitinsky
Varshavskoe shosse 47,
building 4, 115230,
Moscow, Russia
Polina Kachurina
ITMO University
Birzhevaya linia 4,
Vasilevskiy Island,
Saint-Petersburg, Russia
Shashev Sergey
Tsentr Razrabotki
20/1 Pozharova Street,
Sevastopol, Russia
Evgeniya Shamis
Sherpa S Pro
Vernadskogo ave., 82/1,
Moscow, Russia
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. EGOSE ’16, November 22 — 23, 2016, St.Petersburg, Russian Federation Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4859-1/16/11…$15.00 DOI:

Добавить комментарий