Mettl’s assessments have been the biggest filter in our recruitment process. Their platform has helped us reach out to a higher volume our applicant numbers. Mettl constantly keeps innovating on their products and tries to introduce a new aspect to everything.
Evaluation of human qualities – such as attitude, competency, proficiency, accomplishment, and belief, among other constructs- is routinely conducted by administering tests fairly, which are precisely formulated and applied using standardized protocols. Test takers are usually concerned with the results from a test administered to them; they don’t have a proclivity to look into the technical aspects or characteristics of the tool itself. Despite this, many people take the tests internal components into account because they realize that the relevance and usefulness of test result interpretation are dependent on its core features. In technical terms, these internal test attributes are termed as psychometric properties.
The psychometric properties of a test are associated with the data that has been garnered from the assessment to find out how well it evaluates the interest construct. The development of a valid test is conditional on the fact that it has been subjected to statistical analyses, which ascertains that it has adequate psychometric properties.
A good psychometric test must have three fundamental properties- reliability, validity, and norming. Be it hiring or developing employees, choosing the correct set of assessments is pivotal in making or breaking a business.
Besides the reliability and validity of tests, the standardization of the assessments normed for various aspects, such as age, gender, education, profession, employability, etc. also determines the properties of tests.
Psychometric properties, usually, provide insights into a test’s meaningfulness, appropriateness, and usefulness(or rather say, its validity). Let’s say a test is publicized as a measure important for diagnosing a mental disorder such as bipolar disorder. The psychometric properties of a test present the test creators and users with satisfactory evidence of whether the tool performs as portrayed.
The psychometric property of a test focuses on its particular feature. Some psychometric properties speak volumes about the quality of the whole test, while others give weight to its constituent parts, sections, and even individual items. For instance, when considered in totality, a psychometric property could reveal whether the test assesses a single construct or multiple constructs.
The characteristic of a test analyzing only one dimension or multiple dimensions is the psychometric property of the whole instrument. Another psychometric property of the test could point out whether the instrument evaluates the target construct reasonably well for both men and women. We can call this a psychometric property of gender equality. Yet other psychometric properties furnish evidence whether a test assesses a construct consistently (reliability).
Psychometric properties are most often expressed quantitatively. Numerical quantities such as a coefficient or an index are used to represent the property. For example, the reliability coefficient is a numerical value that most students and professionals are familiar with. Even though reliability is mentioned as the psychometric feature for a test, it is expressible in the form of quantitative value.
Likewise, many other psychometric properties are expressed numerically. Meanwhile, a quantifiable value is not often the best way to convey a specific psychometric property. For example, validity, being a hard fact, cannot be suggestively reduced to a single value or index. It is an encompassing psychometric property, but an exhaustive discussion that synopsizes a substantial body of evidence is required to explain test validity.
One must explore and learn about the psychometric properties of tests for two key reasons. First, this knowledge enables makers to create useful tests. Psychometricians and other experts who create tests must analyze and describe the functionality of tests to build them to a predefined level of quality. Second, the awareness about the psychometric properties of a test ensures that the information gained using the instrument could provide a firm foundation for making the right decisions. It stands to reason that counselors, psychologists, policy personnel, educators, and many other professionals often formulate their decisions on the data collected from the tests.
Psychometric properties: Reliability, Validity, and Norming
In definition, a standardized test is administered and scored in a consistent or “standard” manner. They are designed in a way that stabilizes questions, conditions for administering, scoring procedures, and interpretations as consistent.
Standardized testing could be composed of true-false, multiple-choice, authentic assessments, or essays. It’s possible to shape any form of assessment into standardized tests. When it comes to the creation of psychometric evaluations, questions are measured in scales. And these too are often most valid with standardization post-creation.
We should look for these three factors when creating/standardizing psychometric tests:
Psychometric Test Reliability refers to the level to which test scores are accurate and free of measurement mistakes. In other words, does the test measure what it is supposed to measure? It is the consistency in the measurement tool to produce scores by which interpolations can be made.
For instance, a test measuring intelligence should yield the same score for the same person after he or she has completed the rest each time within a short period in between (provided the test taker has not changed regarding his/her intelligence over the period).
A test is reliable as long as it produces similar results over time, repeated administration, or under similar circumstances.
If you were to use a professional dart player as an example, his or her ability to hit the designated target consistently, but not the bullseyes under specified conditions, would classify them as an excellent and reliable dart player. However, this does fail on account of validity. When compared to psychometric assessments, a reliable test is better known for its ability to produce stable results over time.
Over the years, scholars and researchers uncovered multiple ways to check for reliability. Some include testing the same participants at different points of time or presenting the participants with varying versions of the same test to see how consistent the results are.
Suffice it to say that an assessment has to show demonstrably excellent reliability to qualify for validity.
In this type, the two tests that are different use the same content but separate procedures or equipment, and yield results that are the same for each test taker.
Items within the test are examined to see if they appear to measure what the test measures. Internal reliability between test items is referred to as internal consistency.
When two raters score the psychometric test in the same way, inter-scorer consistency is high.
This is when the same test is conducted over time, and the test taker displays consistency in scores over multiple administrations of the same test.
There are always minor discrepancies in psychometric test reliability. Moreover, individuals taking the same psychometric test may have different thoughts, feelings, or ideas at various points in the time, leading to variance in scores. A lot of factors (both stable traits and momentary issues) can result in variation in test scores.
Stable traits include weight, height, and other such characteristics. Momentary inconsistency is attributed to different things such as the health of test-takers, an understanding of a particular test item, and so forth.
Reliability is essential for psychometric tests. After all, what is the point of having the same test yield different results each time, especially if scores can affect employee selection, retention, and promotion?
Psychometricians identify two different categories of errors:
Numerous factors influence test reliability. The timing between two test sessions affects test-retest and alternate/parallel forms reliability. The similarity of content and expectations of subjects regarding different elements of testing affects only the latter type of reliability along with split half and internal consistency.
Changes in subjects over time, such as their environment, physical state, emotional and mental well-being, also need to be considered while assessing the reliability of psychometric tests. Test-based factors such as inadequate testing instructions, biased scoring lacking in objectivity, and guessing on the part of the test-taker also influence the reliability of tests. Tests can generate reliable estimates sometimes and not so stable results other times (Geisinger, 2013).
So, just how reliable is your test? Well, it all depends on these factors:
Test designers construct questions on the psychometric test to assess mental quality (for example, motivation). The test questions difficulty level or the confusion they create through ambiguity can negatively influence reliability. Biases in interpreting the items as well as errors in question construction can only be corrected if test instructions are properly implemented, and the redesign and research process is active and ongoing.
Administration of the test is another area where systemic errors can creep in. Instructions accompanying the analysis should be clear cut and well defined. Errors in the guidance provided to test takers or administrators can have multiple adverse effects on the reliability of the test. Guidelines that affect accurate interpretation could lower test reliability.
Reliability also means that the test has a particular scoring system, by which interpretation of results is possible. All tests comprise instructions on scoring. Errors such as conclusions without basis or substantial proof can lower the reliability of the test. Test construction is associated with research to provide evidence for the conclusions drawn. If there is a systematic error in the test design phase, this can impact reliability too.
Excessive extremes in temperature or distractions of an audio-visual nature can influence test scores regarding reliability. Errors made in administering the psychometric test can also impact the reliability of scores obtained. Human error is possible too, and interpretation or scoring can be influenced by the examiner’s attitude towards the test taker.
The person being examined may suffer from social desirability concerns and give answers that are not reflective of actual choices. Other factors that influence test takers include anxiety, bias, physical factors like illness, or lack of sleep.
Increasing test length can be a way to improve reliability. The longer the test, the more reliable it is considered.
Speed versus power in psychometric testing is an age-old debate. Speed tests are designed to ensure all students cannot complete the items. Power tests provide items of average difficulty and ensure that students have ample opportunity to complete the psychometric test. Test takers can be evaluated with reliability if a test has items that can be completed. Speed tests cannot be measured using internal consistency, parallel form, or test-retest method.
This is another factor whereby the more heterogeneous the scores of the test, the more reliable its measures will be.
When there is low variability among test scores, reliability decreases. If the test is so easy that every test taker can easily complete it, how will it serve as a measure of individual differences?
Validity is qualitatively defined as the test’s efficacy to measure what it claims to measure. Suffice to say, a test with high validity ensures the test items (questions) remain closely linked with the test’s intended focus.
It is understandable to expect a test used in organizations to shed light on how a candidate would perform in a particular job. With this in mind, it is essential to reiterate the difference between reliability and validity, with the former being a prerequisite to the latter.
Let’s consider the same dart player. In repeated trials, he or she continues to miss the mark consistently by about two inches. Of course, this implies a reliable aim. Each shot hits the board in a region two inches from the target. It’s difficult to not question his validity as a professional – considering he or she doesn’t hit the bull’s eye as is the aim of all professional dart players – in comparison to his or her peers.
Reliability and validity go hand in hand, but reliability by no means indicates validity. As our example showed, having the first without the second hints at high consistency, but also inaccurate consistency.
There are tests for validity
Even with a test that is both reliable and valid, there exists a question about results. An assessment fails without quantifiable results, but as often stated – human beings are far from measurable.
Validity is subjectively defined as the test’s capacity to measure what it claims to measure. It’s imperative to say that a psychometric test with high validity guarantees the items remain firmly connected with the test’s intended core interest.
Psychometric tests are often normed against groups for comparison. It also avoids looking at individual items or questions and instead observes the total score of an individual as compared to a representative sample for the same.
A representative sample means using a group of children when developing a test for children and an adult group when developing a test for adults. Also, based on the population, samples are generally made representative based on demographic factors like age, gender, education, religion, etc.
This is primarily a standard practice because a psychometric test score of say 30 correct out of 40 is meaningless unless compared to the performance of others at a similar level on the same test. The practice of using relative scores becomes all the more important when interpreting ability test results.
When you get the 94th percentile on a trait like extraversion, you know that you are simply more extraverted than 94% of the sample group from whom the test makers derived the normal distribution. On the other hand, if you scored 94% on a math test, it implies that you marked about 94 in every 100 questions correctly.
It’s important to note, however, that every test has its appropriate norm group. Data is better developed when the psychometrics is within the context of the role also. For example, if the role possessed numerical work but without the time pressure in real-world scenarios, someone with below average results on numerical reasoning tests may be given the benefit of the doubt.
Where possible, it also makes sense to take the candidate’s response style in interpreting percentile scores. It has something to do with both speed and accuracy, meaning some people may prefer a slower approach through ability tests – which are a part of psychometrics – with emphasis on precision. Others may cover ground on several items with lowered accuracy.
Psychological constructs such as personality have no right or wrong answers associated with them, and can thereby not be marked using percentages. This is why academics and researchers alike resort to norming, among other methods, to make sense of scores on personality assessments.
With growing concerns over costs, conveniences, and other logistical challenges, technology-enabled assessments have become popular over time as well. Simply because they serve to streamline the process, reduce costs, increase efficiencies, allow employers to assess, and analyze more data points than previously deemed possible.
Know-how about the creation or standardization of psychometric tests aside, it’s also imperative to understand how best one can determine the quality of a psychometric test.
Mettl is an online platform that enables both recruiters and companies to measure candidates’ abilities in various domains, helping them make well-informed decisions of recruitment, training, and development of candidates/employees. Access our extensive library of tests and simulators for defining the qualities of top talent by evaluating their underlying abilities, knowledge, skills, and behavioral attributes. Browse through a wide range of psychometric, cognitive, role-centric, and technical assessments to get your people decisions right. After all, it’s the humans, not machines that build successful businesses.
Primarily, it’s the people that build effective businesses and not the tools, data, or technology. It’s them who decide whether or not the business will run constructively. Having the right set of people at work is crucial for every organization. Mettl’s Psychometric assessments and tools cater to every organization’s unique needs. These scientific and data-backed tools are useful in the measurement of human personality and predict the exhibited behavior.
Originally published April 12 2018, Updated August 4 2020