Leveraging Data Science to Identify Fraudulent Scientific Studies


Jeremy Clopton, CFE, CPA, ACDA
Owner, What's Your SQ

A recent article in The Atlantic discussed how a scientific researcher is leveraging new software to address the issue of fraud in scientific studies. Though typically I focus more on the application of analytics for occupational fraud, this article showcases how unstructured data (data that doesn’t fit nicely into a row/column or tabular format) can be used to identify red flags of fraud. 

To combat the fraud issue in research, the researcher is developing software that identifies nucleotide sequences within a published body of research. From there it runs the sequence against a known database to verify it acts as claimed in the research. Identifying text similarities and differences is key to this type of success.

Though it may sound quite advanced, the underlying concept is present in many of the commonly used analytics packages on the market. The concept used is “regular expression” or “regex.” Regex is a sequence of characters used to identify a text string pattern in a larger body of text.  In this situation, the software developer is using the pattern of a particular nucleotide to identify when they are in research. From there, the context is compared to that contained in the master data to determine if there is a discrepancy.

Application to fraud examiners
So, how does this apply to fraud examiners? This baseline text analytics concept can be used to identify instances where confidential data (such as social security numbers, credit card numbers, etc.) is being sent outside the organization via email or stored in an unsecured area of the organization. Using regular expressions allows an examiner to identify patterns in communication data that may indicate sensitive data being shared. With increased risks of intellectual property theft, data sharing and other data-related risks, monitoring communication for the presence of sensitive data is an effective means of being proactive. 

While much of the focus in the use of unstructured data is around the more advanced applications of technology, such as artificial intelligence, it is important to not overlook some of the more basic uses of the data. Employees that are misappropriating data are a risk to the organization and many use email as the initial means to remove that data. Proactive testing using regex functions is a great way to get started identifying these risks.

Hear Jeremy speak about data trends in his Pre-Conference session, "Next-Generation Fraud Examinations: Leveraging Emerging Tech and Advanced Analytics" at the 29th Annual ACFE Global Fraud Conference, June 17-22 in Las Vegas. Register by May 11 to save $100.