If you follow me on Twitter, you may already be aware. Still, if you don’t (I assure you that you’re missing out), I have been researching several technologies in preparation for an OPSEC/Anti-OSINT tool that I am crafting. I am using this tool as a means to push myself harder to learn something new that I can apply professionally. I am also doing this to be able to make a positive difference in the world. Notably, I am explicitly trying to learn Machine Learning and Natural Language Processing (NLP) in Python and R.
When we hear terms like Advanced Persistent, Next-Generation, Machine Learning, Artificial Intelligence (AI), Machine Learning (ML), Single Pane of Glass, etc. from a vendor, we typically think it’s hype or FUD. Talking about the vendor FUD phrases is ironic because my blog and podcast were called Advanced Persistent Security. Often, we are correct. I set off on the journey to learn about learning to build a tool, but also to understand the technologies. I like to stump salespeople from time to time. Also, if these are the wave of the future, there is no time like the present to get acquainted.
So, NLP. What is it? In social engineering circles, it is Neuro-Linguistic Programming. Some (many, if not most) in the scientific community consider it pseudoscience. Regardless, it claims to be able to influence or manipulate people through non-verbal cues from the eyes or touching someone (cringe) or other means. That is not the NLP that I am working on learning.
Natural Language Processing, the more scientific NLP, is a marriage of various disciplines: computer science, data science (including AI and ML), and linguistics. NLP allows libraries and code to read the language as it is written or spoken by humans (naturally, hence the name). When applying slang, pidgins, and dialects, it will “learn” to recognize and respond to them.
Also adjacent to NLP is OCR or Optical Character Recognition. OCR is the means to read data from a document in a non-text format (i.e., pictures, PDF, or Word documents). Having the ability to read the data allows you to open a PDF with a script (perhaps written in Python) and read it, make sense of it, and act as scripted.
Why is this important to InfoSec, and what do we do with it? We could use this in log analysis, network monitoring, analyzing phishing emails, and my personal favorite, OSINT, to name a few. Within log analysis, NLP could be applied to gain further intelligence from logs without writing ridiculously long regular expressions (REGEX) via “learning” the context of the data and what is being sought.
This would likely be in parallel with some Machine Learning, but it is a start. From the ML perspective, it would probably need to utilize supervised or semi-supervised learning with online entry vice unsupervised or reinforcement learning. The online means that it would read the data more closely to real-time than by ingesting a defined dataset. The supervision of learning refers to telling the “machine” whether it was correct or not. In some instances of learning logs, unsupervised learning could be useful in determining indicators of compromise or adversarial TTPs based on log data in two sets: breached (event data) and non-breached data. Reinforcement training would be more applicable for tuning and improvement.
Back to NLP, the same concepts apply in network monitoring as log analysis, except it would be network traffic and PCAPs being analyzed. PCAP analysis with NLP and ML may be better suited for analyzing a user’s behavior and attempting to identify when their accounts have been taken over or for insider threat predictions. However, I have reservations as to the Orwellian nature of the latter.
For phishing email analysis, the NLP portion could be used to build a large data set, a corpus, and analyze the phish to the exploit kit or threat actor that is controlling it. Such analysis could also help thwart business email compromise beyond technical controls like SPF, DKIM, and DMARC.
In OSINT, it could be combined with aspects of data mining to read target’s websites and employee resources to determine how an organization operates or critical terms like “Cast Member” or “Associate” as terms for employees in the cases of Disney and Walmart respectively. Depending on what is sought, it could help investigators find what they are looking for using context as opposed to just keywords.
Another innovative, FUD-free implementation of NLP would be assisting authorities and organizations like Trace Labs (who run Missing Persons CTF events [They have a Global Event on February 1, 2020]) using ML and NLP to read about the subject’s patterns online, then release the code to look in various places. Each time, the accuracy could get better with successful training.
In conclusion, there is a lot of remaining research to be done about ML and NLP. There are many possible applications for the discipline, but it will be challenging to both learn and also cut through vendor hype and FUD. For me, I plan on doing more research, and I am considering pursuing a second master’s degree in Data Science. From there, who knows? I might try to complete a doctorate, or I may stay happily in my home-office hacking the planet.