A recent report by the Coalition Against Insurance Fraud suggests that insurers are pursuing predictive analytics and other fraud detection technology with added vigor. In fact, 95% of survey respondents said that they use anti-fraud technology, an increase from 88% in 2012. But insurers cited data integration and poor data quality as a major challenge in implementing anti-fraud technology. Many projects get off track before they even get started because of data access and data quality challenges.
Data, data everywhere and not a drop to analyze
Insurers are brimming with data. One of the biggest challenges insurance companies face when implementing predictive analytics is gaining access to the right data sources. Efforts are hampered by multiple legacy claim applications, systems that are segregated by line of business, disparate homegrown SIU case management systems, and third-party vendor systems that house critical data like medical billing information. Much of the useful information is contained in lengthy unstructured text fields like claim notes. Hand-keyed data fields often contain spelling and transposition errors. Frequently, there is no universal key identifying individuals and entities across claims, policy, and billing systems. Of course, we are adding new high-volume sources every day, such as feeds from social media and telematics devices.
While consolidating this data can be complex, narrowing the scope of the effort and using robust data integration and data quality tools can make this happen much faster and more cost-effectively. Addressing data quality and integration issues is critical to producing a successful model.
Garbage in, garbage out
Some fraud detection technology systems do not account for these data quality issues. If these issues are not addressed up front, a system can still be deployed and may even provide some value. However, users will forever be plagued by false positives, missed opportunities, and erroneous flags. The quality of fraud analytics depends directly on the quality of the input data. Here are four key steps to prepare insurance data for fraud analytics:
1. Integrate data silos. Core processing systems are designed to serve a specific purpose that often has nothing to do with aggregated data analysis. Even if they originate in different places, claims, policy, application, billing, and medical data sources need to be joined together for purposes of fraud analytics. Don’t forget to include legacy systems and other less formal “systems” that might be used during everyday business processes, like spreadsheets, watch lists, case-management applications, and shared file systems.
During this stage, it’s critical to document the integration efforts and ensure that they are repeatable and auditable. This will be critical when you enable fraud analytics scoring in production.
2. Deal with missing and erroneous data. Do your systems contain individuals whose Social Security number is 123-45-6789 or 999-99-9999? Have you encountered a claim file with a missing telephone number? If these errors are ignored, they can have a dramatic negative impact on fraud analytic results. Leading data quality tools can help identify, repair, and replace missing or erroneous data.
In some cases, the missing data is available in another system, while in other instances it can be inferred based on a combination of other sources. During this stage, it is also useful to standardize formats for common fields like addresses.
3. Resolve entities. Once data is aggregated from multiple systems, it is important to identify whether the same individuals, companies, or other entities exist in multiple places. One system might capture name and Social Security number, while another system uses name and date of birth. Entity resolution techniques can be used to link these two records and identify them as the same individual. Entity resolution can involve simple rules, but the best results are seen when more advanced analytical techniques are used to determine the likelihood of matching.
This step is important especially if social network analytics or link analysis will be used in the fraud analytics solution. A single person could have roles as an insured, claimant, witness, driver, vendor, and employee across multiple claims. Being able to link them is powerful in detecting suspicious activity.
4. Process unstructured text. By some estimates, up to 80% of insurer data is kept in text format. Some of the best information about a claim file is captured in the loss description or claim notes fields. But dealing with data isn’t always simple. Abbreviations, acronyms, industry jargon, and misspellings are commonplace and need to be addressed by a text analytics solution that contains a library of terminology specially designed for insurance data.
During the analysis of text, additional model variables can also be created. This is a very powerful way to expand the scope of fraud analytics without having to include external data sources. Machine learning and natural-language processing techniques should be used to find and create useful variables for fraud analytic modeling.
Effective data management is essential to every fraud analytics implementation. The investment made in the data cleansing process up front will pay significant dividends in the form of improved detection rates.
James Ruotolo is an insurance fraud technologist, thought leader and the principal for insurance fraud solutions at SAS. Connect with him on Twitter @jdruotolo. View Full Bio