This is the third post in a series of posts about AMPLEC, my most recent project. See more details about the project and the previous posts here.
In this post I will cover the heart of the project: How the malware-analysis data was processed so the LLM can query it effectively. I called this process “Naturalization” because we take structured JSON-data and turn it into something most closely resembling natural language while still following a replicable structure.
The Problem #
While starting to do the first experiments for this project, I hit my first major roadblock.
Small Models, Big Headaches #
While assessing the feasibility of the project I mostly experimented with ChatGPT (GPT-4o), which frankly has no issues pulling information from large JSON blocks without hallucinating. When I then started to assess different open source models, I was confronted with my own greenness. Because especially the smaller open source models like Llama3.2:3b or Llama3.1:8b were both hallucinating pretty significantly on easy data extraction tasks from a 10-line JSON block.
With one of the hard requirements of the project being data security and the ability to run the whole system locally, I needed to make it work with the small open models I could conceivably run on a laptop without a dedicated graphics card.
Why LLMs Struggle with JSON #
Now I need to preface this section with an acknowledgement, there are many LLMs having no issues with structured data, however those are usually either optimized or just in general larger than the models I was able to run.
Tokenization #
The reason is based on how LLMs fundamentally work when ingesting data. This process, called tokenization, involves the models turning words into tokens (similarly to the picture below).
Tokenization works by having a predefined dictionary of tokens or subwords, that can be viewed as a regular language and get mapped to numerical values which the LLM can work better with. To understand the issue more deeply, let’s briefly digress and shed some light on how tokens are defined. A widely used system for that is so-called Byte-Pair Encoding (BPE). It works by defining a beginning alphabet of characters as bytes, then combining the characters into subwords for often used subwords, like combining i and n to in, then in the next step of the iterative process adding the letter g to the subword in to create ing. This process will be done for the most commonly used tokens, until a predefined size of the alphabet is reached. Most modern LLMs also have a lot of complete English words in their dictionary, which is also visible in the screenshot above. For more complex examples (as visible below) words can also be made up from many tokens.
Pay attention to yourself! #
Now when looking at the tokenization of JSON input, the use of indentation, ", { and [ comes with the problem of creating many tokens for a comparatively small amount of information.
Each token then is assigned a multidimensional vector which in turn is the “language” the LLM actually can understand.
Most LLMs fundamentally work on the basis of self-attention, which involves calculating the importance of one token for every other token. This helps to identify context and relationships of words and information when processing language.
Context retention in JSON #
With knowing how tokens are defined, how the processing works and why JSON has a lower information-per-token density, let’s take a look at the best case when ingesting structured data:
{
"sha256": "de34da69219e4da77015469778509fc15cb412a8f3c808124eed7a7725c519a0",
"family": "LummaStealer"
}When ingesting this code block, the context of the sha256 and the family tag are super close together. Therefore the LLM has no issues figuring that one out. Now let’s take a look at a harder example:
{
"sha256":"de34da69219e4da77015469778509fc15cb412a8f3c808124eed7a7725c519a0",
"urls": [
"fishy-business.com",
"absolutely-not-malware.com"
],
"ips": [
"192.168.0.1",
"127.0.0.1"
],
"signatures":[
{
"name": "Commercial obfuscation software .NET Reactor by Eziriz",
"score": 9,
"indicators": []
},
{
"label": "fw_startup_file",
"name": "Drops startup file",
"score": 7,
"indicators": [
{
"ioc": "C:\\Users\\Admin\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Startup\\ilsucsfth.vbs",
"description": "File created",
"procid": 30
}
]
},
{
"name": "Suspicious use of WriteProcessMemory",
"indicators": [
{
"description": "PID 2424 wrote to memory of 5060",
"pid": 2424,
"procid": 30,
"procid_target": 31
},
{
"description": "PID 2424 wrote to memory of 5060",
"pid": 2424,
"procid": 30,
"procid_target": 31
},
{
"description": "PID 2424 wrote to memory of 5060",
"pid": 2424,
"procid": 30,
"procid_target": 31
},
{
"description": "PID 2424 wrote to memory of 5060",
"pid": 2424,
"procid": 30,
"procid_target": 31
}
]
}
],
"family": "LummaStealer"
}In this example, the whole problem becomes more obvious, as the connection between the sha256 hash and the family becomes less strong. Because we have a large number of tokens between the two important data points that we need to view in relation with each other.
This shows that using raw JSON as a means of data ingestion is not ideal. And especially with the small unspecialized LLMs I tried to make this work with, the described effect was pronounced.
The Idea #
The problem I had was twofold, firstly I needed to be careful to not overwhelm the LLM or rather the context window of said LLM (which describes the amount of information an LLM can process) and secondly I needed something that could be better understood by an LLM than structured JSON data.
Now you don’t need to be an Einstein-level genius to figure out that something more closely related to natural language would probably be easier for the LLM to ingest, given that LLMs are mostly trained on natural language. This was how the idea to use natural language or at least something somewhat resembling natural language was born.
The Solution #
The concrete solution is process I called “Naturalization”. Before getting deeper into how the process works, I’ll briefly describe the design goals I had.
- Reproducible: The same JSON input ingested twice needs to yield the same result every time.
- Safe Defaults: The implementation can be specifically optimized for certain keys or structures that are expected to be in the JSON, BUT every possible JSON needs to yield a somewhat useful output.
- Easy to Adapt/Improve: As this is an iterative process, the process needs easily improvable and adaptable.
How the Algorithm Works #
-
Recursive Descent The routine walks each dictionary (and sub-dictionary) in turn. At every level it looks for a primary identifier—for instance a SHA-256 hash—and starts forming a sentence around it.
-
Headline Construction
- If a known identifier such as
"sha256"is found, the sentence begins:The sample with SHA-256 <hash> - For unfamiliar keys, the headline defaults to
has <key_name>.
- If a known identifier such as
-
Recursive Expansion Remaining keys are processed one by one, each time appending to—or nesting under—the current headline. In the simple example:
{ "sha256": "de34da69219e4da77015469778509fc15cb412a8f3c808124eed7a7725c519a0", "family": "LummaStealer" }the algorithm
- detects the hash as the identifier, producing the opening clause;
- then encounters
"family"and applies the default rulehas family; - finally combines both fragments with the value to yield:
The sample with SHA-256 de34da69219e4da77015469778509fc15cb412a8f3c808124eed7a7725c519a0 has family LummaStealer.
Conclusion #
Now you know, why I built the process of Naturalization, why and in which cases the problems I encountered will apply to you and if so, you have an idea how to tackle them.