We are using machine learning to understand structured human data, including language, document images, music, and other complex artifacts.
This research combines natural language processing and machine learning, focusing on unsupervised methods for deciphering hidden structure. Specific applications of these unsupervised methods have included document summarization, author attribution, historical document transcription, transcription of piano music, and using sensors for characterizing behavior in small animals.
Case Study: Automated Analysis of Cybercriminal Markets
Underground forums are widely used by criminals to buy and sell a host of stolen items, datasets, resources, and criminal services. These forums contain important resources for understanding cybercrime. However, the number of forums, their size, and the domain expertise required to understand the markets makes manual exploration of these forums unscalable. In this work, we propose an automated, top-down approach for analyzing underground forums. Our approach uses natural language processing and machine learning to automatically generate high-level information about underground forums, first identifying posts related to transactions, and then extracting products and prices. We also demonstrate, via a pair of case studies, how an analyst can use these automated approaches to investigate other categories of products and transactions. We use eight distinct forums to assess our tools: Antichat, Blackhat World, Carders, Darkode, Hack Forums, Hell, L33tCrew and Nulled. Our automated approach is fast and accurate, achieving over 80% accuracy in detecting post category, product, and prices.
For More Information:
Rebecca S. Portnoff, Sadia Afroz, Greg Durrett, Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, Damon McCoy, Kirill Levchenko, and Vern Paxson. 2017. Tools for Automated Analysis of Cybercriminal Markets. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 657-666.
Example post and annotations from Darkode, with one sentence per line. We underline our annotations of both the core product (mod DCIBot) and the method for obtaining that product (sombody).