feeds

by zer0x0ne — on

cover-image

some of my favourite websites: the hackers news trail of bits dark reading threatpost tripwire security weekly xkcd



xkcd

Retrieved title: xkcd.com, 3 item(s)
Spinthariscope

Other high scorers are melt-in-your-hand aluminum-destroying gallium and tritium-powered glowsticks. Lawn darts are toward the other end.

Language Development

The worst is the Terrible Twos, when they're always throwing things and shrieking, "forsooth, to bed thou shalt not take me, cur!"

Decorative Constants

Arguably, the '1/2' in the drag equation is purely decorative, since drag coefficients are already unitless and could just as easily be half as big. Some derivations give more justification for the extra 1/2 than others, but one textbook just calls it 'a traditional tribute to Euler and Bernoulli.'

Security Weekly

Retrieved title: Security Weekly, 3 item(s)
Ransomware Damage Claims Driving Insurance Hikes

The costs of cyber insurance policies are rising exponentially while underwriters are tightening the rules around who qualifies for cyber insurance, and at the same time, insurer capacity is constricting dramatically. The numbers are all over the place, but the latest statistics from the Council of Insurance Agents and Brokers reported a 25.5% increase in […]

The post Ransomware Damage Claims Driving Insurance Hikes appeared first on Security Weekly.

Decrypt As If Your Security Depends on It

Encryption has reached near-full adoption by internal teams hoping to implement stronger security and privacy practices. Simultaneously, attackers are using the same mechanisms to hide their malicious activity from the defender’s line of sight. According to the Ponemon Institute’s 2021 Global Encryption Trends Study, 50% of organizations have an encryption plan consistently applied across their […]

The post Decrypt As If Your Security Depends on It appeared first on Security Weekly.

DevSecOps Scanning Challenges & Tips

There are many ways to do DevSecOps, and each organization — each security team, even — uses a different approach. Questions such as how many environments you have and the frequency of deployment of those environments are important in understanding how to integrate a security scanner into your DevSecOps machinery. The ultimate goal is speed […]

The post DevSecOps Scanning Challenges & Tips appeared first on Security Weekly.

Tripwire

Retrieved title: The State of Security, 3 item(s)
Why Is It Important to Invest in OT Cybersecurity for 2022?

As we enter 2022, it’s important that organizations invest in cybersecurity for their operational technology (OT) systems. Why? One of the reasons is that Industry 4.0 can sometimes introduce more risk for OT. This is evident in several Industry 4.0 market trends. For example, there’s digital twin infrastructure. That’s where you make a digital copy […]… Read More

The post Why Is It Important to Invest in OT Cybersecurity for 2022? appeared first on The State of Security.

How Should Organizations Tackle Their Data Privacy Requirements?

Data is among the most valuable assets that need to be safeguarded at all costs. But in the digitally-driven business world, cybercrimes are prevalent, making data protection and data privacy a main focal point. The increasing use of technology and the growing exposure to evolving cyber threats have dramatically changed the data security and privacy […]… Read More

The post How Should Organizations Tackle Their Data Privacy Requirements? appeared first on The State of Security.

Malicious USB drives are being posted to businesses

A notorious cybercrime gang, involved in a series of high profile ransomware attacks, has in recent months been sending out poisoned USB devices to US organisations. As The Record reports, the FBI has warned that FIN7 – the well-organised cybercrime group believed to behind the Darkside and BlackMatter ransomware operations – has been mailing out […]… Read More

The post Malicious USB drives are being posted to businesses appeared first on The State of Security.

The Hacker News

Retrieved title: The Hacker News, 3 item(s)
Dark Web's Largest Marketplace for Stolen Credit Cards is Shutting Down

UniCC, the biggest dark web marketplace of stolen credit and debit cards, has announced that it's shuttering its operations after earning $358 million in purchases since 2013 using cryptocurrencies such as Bitcoin, Litecoin, Ether, and Dash. "Don't build any conspiracy theories about us leaving," the anonymous operators of UniCC said in a farewell posted on dark web carding forums, according to

High-Severity Vulnerability in 3 WordPress Plugins Affected 84,000 Websites

Researchers have disclosed a security shortcoming affecting three different WordPress plugins that impact over 84,000 websites and could be abused by a malicious actor to take over vulnerable sites. "This flaw made it possible for an attacker to update arbitrary site options on a vulnerable site, provided they could trick a site's administrator into performing an action, such as clicking on a

Ukrainian Government Officially Accuses Russia of Recent Cyberattacks

The government of Ukraine on Sunday formally accused Russia of masterminding the attacks that targeted websites of public institutions and government agencies this past week. "All the evidence points to the fact that Russia is behind the cyber attack," the Ministry of Digital Transformation said in a statement. "Moscow continues to wage a hybrid war and is actively building forces in the

Threatpost

Retrieved title: Threatpost, 3 item(s)
Top Illicit Carding Marketplace UniCC Abruptly Shuts Down  

UniCC controlled 30 percent of the stolen payment-card data market; leaving analysts eyeing what’s next.

Real Big Phish: Mobile Phishing & Managing User Fallibility

Phishing is more successful than ever. Daniel Spicer, CSO of Ivanti, discusses emerging trends in phishing, and using zero-trust security to patch the human vulnerabilities underpinning the spike.

Critical Cisco Contact Center Bug Threatens Customer-Service Havoc

Attackers could access and modify agent resources, telephone queues and other customer-service systems – and access personal information on companies’ customers.

Dark Reading

Retrieved title: Dark Reading, 3 item(s)
Russia Takes Down REvil Ransomware Operation, Arrests Key Members

Timing of the move has evoked at least some skepticism from security experts about the country's true motives.

The Cybersecurity Measures CTOs Are Actually Implementing

Companies look to multifactor authentication and identity and access management to block attacks, but hedge their bets with disaster recovery.

Maryland Dept. of Health Responds to Ransomware Attack

An attack discovered on Dec. 4, 2021 forced the Maryland Department of Health to take some of its systems offline.

Trail of Bits

Retrieved title: Trail of Bits Blog, 3 item(s)
Finding unhandled errors using CodeQL

By Fredrik Dahlgren

One of your developers finds a bug in your codebase—an unhandled error code—and wonders whether there could be more. He combs through the code and finds unhandled error after unhandled error. One lone developer playing whack-a-mole. It’s not enough. And your undisciplined team of first-year Stanford grads never learned software engineering. You’re doomed.

Good developers know that unhandled errors can be exploitable and cause serious problems in a codebase. Take CVE-2018-1002105, a critical vulnerability in Kubernetes allowing attackers to leverage incorrectly handled errors to establish a back-end connection through the Kubernetes API.

At Trail of Bits, we find issues like this all the time, and we know that there are better ways to find the rest than manually searching for them one by one. One particularly recent discovery of this problem motivated us to write this post. Rather than manually sifting through the codebase like our poor developer playing whack-a-mole, we used CodeQL to conduct a variant analysis (taking an existing vulnerability and searching for similar patterns). In this post, we’ll walk you through how we used CodeQL to whack all the moles at once.

Building a CodeQL database

To be able to run CodeQL queries against the codebase, we first needed to build a CodeQL database. Typically, this is done with the CodeQL CLI using the following command:

codeql database create -l  -c '' 

In our case, the codebase under audit was developed on Windows using Visual Studio. Since CodeQL does not integrate with Visual Studio directly, we used MSBuild.exe to build solutions from the Windows command line using the following command:

codeql database create -l cpp -c 'MSBuild.exe .sln' .codeql

Setting up a custom query pack

To be able to run queries against the database, we defined a custom query pack, or a QL pack, which contains query metadata. Query packs can also be used to define custom query suites and query test suites. (If this makes your heart beat faster, see here and here.) In the same directory housing our custom queries, we created a file named qlpack.yml with the following content:

name: 
version: 0.0.1
libraryPathDependencies: [codeql-cpp]

This file defines the custom query pack and its dependencies. The last line of the qlpack.yml file simply indicates that the query pack depends on the built-in CodeQL libraries for C and C++, which we need to get off the ground.

Finding unhandled errors

For this post, let’s say that the codebase used a custom error type called CustomErrorType to propagate errors. To locate all calls to functions returning CustomErrorType, we started by creating a new CodeQL type called CustomError:

class CustomError extends FunctionCall {
    CustomError() {
        this.getUnderlyingType().getName() = "CustomErrorType"
    }
}

Since return values are represented as function calls in CodeQL, it makes sense to extend the FunctionCall type (which is, in fact, a subtype of the more general Expr type used to model arbitrary expressions). Using this.getUnderlyingType() ensures that the name of the underlying type is CustomErrorType. This means that we capture all function calls in which the return type is either CustomErrorType or any typedef that resolves to CustomErrorType.

To test that the CustomError class does what we expect, we simply ran a query over the codebase and selected all CustomErrorType return values. To do so, we added the following select clause immediately below the class definition:

 
from 
    CustomError ce
select
    ce.getLocation(), 
    "Unhandled error code in ", ce.getEnclosingFunction().getName(), 
    "Error code returned by ", ce.getTarget().getName()
 

Here, ce.getEnclosingFunction() returns the function containing the CustomErrorType instance (i.e., the calling function), and ce.getTarget() returns the target function of the underlying FunctionCall (i.e., the called function).

We saved the file under a descriptive and colorful name—for this post, let’s call it UnhandledCustomError.ql. To run the query, we wrote the following:

codeql query run -d  UnhandledCustomError.ql

This query returns all call sites of functions in the codebase that return a value of type CustomErrorType, along with the names of the calling function and the called function.

Developing new queries iteratively in this way—by first over-approximating the vulnerability class you’re trying to model and then successively refining the query to prune false positives—makes it easier to catch mistakes as they happen since actually running a query on a codebase is a bit of a black box.

So what is an unhandled error?

To be able to restrict the results to unhandled errors, we need to define what it means to handle an error using CodeQL. Intuitively, handling an error means that the return value is acted upon and affects control flow in some way. This idea can be captured using CodeQL by checking whether the return value taints the condition of a branching statement, like an if statement, a while statement, or a switch statement. As CodeQL supports both local and global taint tracking, we had a choice of how to model this.

In our case, we were initially a bit concerned about how CodeQL’s global taint tracking engine would handle itself on a larger codebase, so we decided to try modeling the problem using local taint tracking. As a first approximation, we considered cases in which the returned error code directly affected the control flow of the calling function in some way. To capture these cases, we added the following predicate to the CustomError CodeQL type (thus, this below refers to an instance of CustomError):

 
// True if the return value is checked locally.
predicate isChecked() {
    // The return value flows into the condition of an if-statement.    	    
    exists (IfStmt is |
        TaintTracking::localTaint(
            DataFlow::exprNode(this),
            DataFlow::exprNode(is.getCondition().getAChild*())
        )
    ) or
    // The return value flows into the condition of a while-statement.
    exists (WhileStmt ws |
        TaintTracking::localTaint(
            DataFlow::exprNode(this),
            DataFlow::exprNode(ws.getCondition().getAChild*())
        )
    ) or
    // The return value flows into the condition of a switch-statement.
    exists (SwitchStmt ss |
        TaintTracking::localTaint(
            DataFlow::exprNode(this), 
            DataFlow::exprNode(ss.getExpr().getAChild*())
        )
    )
}

Since TaintTracking::localTaint only models local data flow, we did not need to require that the conditional statement (the sink) is located in the same function as the returned error (the source). With local taint tracking, we got this for free.

Our intuition told us that we wanted to model taint flowing into the condition of a branching statement, but looking at how this is modeled, we actually required taint to flow into a sub-expression of the condition. For example, is.getCondition().getAChild*() returns a sub-expression of the if statement condition. (The * indicates that the operation is applied 0 or more times. You can use + for 1 or more times.)

It may not be immediately obvious why we needed to use getAChild() here. If this taints a sub-expression of the if statement condition C, then it is natural to assume that this would also taint the entire condition. However, looking at the CodeQL taint tracking documentation, it is clear that taint propagates only from a source to a sink if a “substantial part of the information from the source is preserved at the sink.” In particular, boolean expressions (which carry only a single bit of information) are not automatically considered tainted by their individual sub-expressions. Thus, we needed to use getAChild() to capture that this taints a sub-expression of the condition.

It is worth mentioning that it would have been possible to use the DataFlow module to model local data flow: both DataFlow::localFlow and TaintTracking::localTaint can be used to capture local data flow. However, since DataFlow::localFlow is used to track only value-preserving operations, it made more sense in our case to use the more general TaintTracking::localTaint predicate. This allowed us to catch expressions like the following, in which the returned error is mutated before it is checked:

    if ( ((CustomErrorType)(response.GetStatus(msg) & 0xFF)) == NO_ERROR )
    {
    	[…]
    }

To restrict the output from the CodeQL select statement, we added a where clause to the query:

 
from 
    CustomError ce
where
    not ce.isChecked()
select
    ce.getLocation(), 
    "Unhandled error code in ", ce.getEnclosingFunction().getName(), 
    "Error code returned by ", ce.getTarget().getName()
 

Refining the query

Running the query again, we noticed that it found numerous locations in the codebase in which a returned error was not handled correctly. However, it also found a lot of false positives in which the return value did affect control flow globally in some way. Reviewing some of the results manually, we noticed three overarching classes of false positives:

  1. The returned error was simply returned from the enclosing function and passed down the call chain.
  2. The returned error was passed as an argument to a function (which hopefully acted upon the error in some meaningful way).
  3. The returned error was assigned to a class member variable (which could then be checked elsewhere in the codebase).

The local behavior in all three of these cases could clearly be modeled using local taint tracking.

First, to exclude all cases in which the returned error was used to update the return value of the calling function, we added the following predicate to the CustomError class:

// The return value is returned from the enclosing function.
predicate isReturnValue() {
    exists (ReturnStmt rs |
        TaintTracking::localTaint(
            DataFlow::exprNode(this),
            DataFlow::exprNode(rs.getExpr())
        )
    )
}

Second, to filter out cases in which the return value was passed as an argument to some other function, we added the following predicate:

// The return value is passed as an argument to another function.
predicate isPassedToFunction() {
    exists (FunctionCall fc |
        TaintTracking::localTaint(
            DataFlow::exprNode(this),
            DataFlow::exprNode(fc.getAnArgument())
        )
    )
}

Again, since TaintTracking::localTaint only models local data flow, we did not need to require that the enclosing function of the FunctionCall node fc is identical to the enclosing function of this.

Finally, to model the case in which the returned error was used to update the value of a class member variable, we needed to express the fact that the calling function was a class method and that the return value was used to update the value of a member variable on the same class. We modeled this case by casting the enclosing function to a MemberFunction and then requiring that there is a member variable on the same object that is tainted by this.

 
// Test if the return value is assigned to a member variable.
predicate isAssignedToMemberVar() {
    exists (MemberVariable mv, MemberFunction mf |
        mf = this.getEnclosingFunction() and
        mf.canAccessMember(mv, mf.getDeclaringType()) and
        TaintTracking::localTaint(
            DataFlow::exprNode(this), 
            DataFlow::exprNode(mv.getAnAccess())
        )
    )
}

Note that it is not enough to require that data flows from this to an access of a member variable. If you do not restrict mv further, mv could be a member of any class defined in the codebase. Clearly, we also needed to require that the calling function is a method on some class and that the member variable is a member of the same class. We captured this requirement using the predicate canAccessMember, which is true when the enclosing method mf can access the member variable mv in the context of the mf.getDeclaringType() class.

Running the updated version of the query, we then noticed that some of the results were from unit tests. Since issues in unit tests are typically of less interest, we wanted to exclude them from the final result. This, of course, could easily be done using grep -v on the resulting output from codeql, but it could also be done by restricting the location of the call site using CodeQL itself.

To filter by file path, we defined a new class called IgnoredFile, which captured the type of files we wanted to exclude from the result. In this case, we excluded any file with an absolute path containing the word "test":

 
class IgnoredFile extends File {
    IgnoredFile() {
        this.getAbsolutePath().matches("%test%")
    }
}

We then added the following line to the where clause of the final query, which excluded all the locations that we were less interested in:

not ce.getFile() instanceof IgnoredFile

The final query resulted in slightly over 100 code locations that we were able to review and verify manually. For reference, the final query is located here.

But what about global data flow?

CodeQL supports global data flow and taint tracking through the DataFlow::Configuration and TaintTracking::Configuration classes. As explained earlier, we can use the DataFlow module to track value-preserving operations and the TaintTracking module to track more general flows in which the value may be updated along the flow path. While we were initially concerned that the codebase under review was too large for CodeQL’s global taint tracking engine to handle, we were also curious to see whether a global analysis would give us more accurate results than a local analysis could. As it turns out, the query using global data flow was easier to express, and it achieved more accurate results with the same running time as the query using local data flow!

Since we didn’t want to restrict ourselves to value-preserving operations, we needed to extend the TaintTracking::Configuration class. To do so, we defined what it means to be a source and a sink by overriding the isSource and isSink predicates as follows:

 
class GuardConfiguration extends TaintTracking::Configuration {
    GuardConfiguration() { this = "GuardConfiguration" }
  
    override predicate isSource(DataFlow::Node source) {
        source.asExpr().(FunctionCall).getUnderlyingType().getName() =
            "CustomErrorType"
    }
  
    override predicate isSink(DataFlow::Node sink) {
        exists (IfStmt is | sink.asExpr() = is.getCondition().getAChild*()) or
        exists (WhileStmt ws | sink.asExpr() = ws.getCondition().getAChild*()) or
        exists (SwitchStmt ss | sink.asExpr() = ss.getExpr().getAChild*())
    }
}

We then redefined the predicate CustomError::isChecked in terms of global taint tracking as follows:

 
class CustomError extends FunctionCall {
    CustomError() {
        this.getUnderlyingType().getName() = "CustomErrorType"
           }
  
    predicate isCheckedAt(Expr guard) {
        exists (GuardConfiguration config |
            config.hasFlow(
                DataFlow::exprNode(this), 
                DataFlow::exprNode(guard)
            )
        )
    }
  
    predicate isChecked() {
        exists (Expr guard | this.isCheckedAt(guard))
    }
}

That is, the return error is handled if it taints the condition of an if, while, or switch statement anywhere in the codebase. This actually made the entire query much simpler.

Interestingly, the run time for the global analysis turned out to be about the same as the run time for local taint tracking (about 20 seconds to compile the query and 10 seconds to run it on a 2020 Intel i5 MacBook Pro).

Running the query using global taint tracking gave us over 200 results. By manually reviewing these results, we noticed that error codes often ended up being passed to a function that created a response to the user of one of the APIs defined by the codebase. Since this was expected behavior, we excluded all such cases from the end result. To do so, we simply added a single line to the definition of GuardCondition::isSink as follows:

 
override predicate isSink(DataFlow::Node sink) {
    exists (ReturnStmt rs | sink.asExpr() = rs.getExpr()) or
    exists (IfStmt is | sink.asExpr() = is.getCondition().getAChild*()) or
    exists (WhileStmt ws | sink.asExpr() = ws.getCondition().getAChild*()) or
    exists (SwitchStmt ss | sink.asExpr() = ss.getExpr().getAChild*()) or
    exists (IgnoredFunctionCall fc | sink.asExpr() = fc.getExtParam())
}

Here, IgnoredFunctionCall is a custom type capturing a call to the function generating responses to the user. Running the query, we ended up with around 150 locations that we could go through manually. In the end, the majority of the locations identified using CodeQL represented real issues that needed to be addressed by the client. The updated file UnhandledCustomError.ql can be found here.

At Trail of Bits, we often say that we never want to see the same bug twice in a client’s codebase, and to make sure that doesn’t happen, we often deliver security tooling like fuzzing harnesses and static analysis tools with our audit reports. In this respect, tools like CodeQL are great, as they let us encode our knowledge about a bug class as a query that anyone can run and benefit from—in effect, ensuring that we never see that particular bug ever again.

Toward a Best-of-Both-Worlds Binary Disassembler

By Stefan Nagy

This past winter, I was fortunate to have the opportunity to work for Trail of Bits as a graduate student intern under the supervision of Peter Goodman and Artem Dinaburg. During my internship, I developed Dr. Disassembler, a Datalog-driven framework for transparent and mutable binary disassembly. Though this project is ongoing, this blog post introduces the high-level vision behind Dr. Disassembler’s design and discusses the key implementation decisions central to our current prototype.

Introduction

Binary disassembly is surprisingly difficult. Many disassembly tasks (e.g., code/data disambiguation and function boundary detection) are undecidable and require meticulous heuristics and algorithms to cover the wide range of real-world binary semantics. An ideal disassembler has two key properties: (1) transparency, meaning that its underlying logic is accessible and interpretable, and (2) mutability, meaning that it permits ad hoc interaction and refinement. Unfortunately, despite the abundance of disassembly tools available today, none have both transparency and mutability. Most off-the-shelf disassemblers (e.g., objdump, Dyninst, McSema, and Angr) perform “run-and-done” disassembly, and while their underlying heuristics and algorithms are indeed open source, even the slightest of changes (e.g., toggling on a heuristic) requires a complete rebuild of the tool and regeneration of the disassembly. In contrast, popular commercial disassemblers like IDA Pro and Binary Ninja provide rich interfaces for user-written plugins, yet these tools are almost entirely proprietary, making it impossible to fully vet where their core heuristics and algorithms fall short. Thus, reverse engineers are left to choose between two classes of disassemblers: those full of ambiguity or those with zero flexibility.

In this blog post, I introduce our vision for a best-of-both-worlds (transparent and mutable) platform for binary disassembly. Our approach was inspired by recent disassembly tools like ddisasm and d3re, which use the Soufflé Datalog engine. Dr. Disassembler uses Trail of Bits’ in-house incremental and differential Datalog engine, Dr. Lojekyll, to specify the disassembly process. Below, I describe how Dr. Disassembler’s relational view of disassembly is a step toward transparent, mutable disassembly—streamlining the integration of new heuristics, algorithms, and retroactive updates—without the need to perform de novo disassembly per every incremental update.

Background: Disassembly, Datalog, and Dr. Lojekyll

Disassembly is the process of translating a binary executable from machine code into a human-interpretable, assembly language representation of the program. In software security, disassembly forms the backbone of many critical tasks such as binary analysis, static rewriting, and reverse engineering. At Trail of Bits, disassembly is the crucial first step in our executable-to-LLVM lifting efforts, such as Remill and McSema.

At a high level, a disassembler begins by first parsing a binary’s logical sections to pinpoint those that contain executable code. From there, instruction decoding translates machine code into higher-level instruction semantics. This procedure uses one of two strategies: linear sweep or recursive descent.

Linear sweep disassemblers (e.g., objdump) perform instruction decoding on every possible byte, beginning at the very first byte index. However, on variable-length instruction set architectures like x86, a linear sweep disassembler that naively treats all bytes as instructions could perform instruction decoding on non-instruction bytes (e.g., inlined jump tables). To overcome this issue, many modern disassemblers improve their analyses by recovering metadata (e.g., debugging information) or applying data-driven heuristics (e.g., function entry patterns).

On the other hand, recursive descent disassemblers (e.g., IDA Pro) follow the observed control flow to selectively re-initiate linear sweep only on recovered branch target addresses. While recovering the target addresses of jump tables is generally sound, recovering the targets for indirect calls is a far more challenging problem, in which common-case soundness has yet to emerge.

Datalog is one of the more popular members in a class of programming languages known as logical programming. Compared to imperative programming languages (e.g., Python, Java, C, and C++), which are structured around a program’s control flow and state, logical programming (e.g., Prolog and Datalog) is structured solely around logical statements. In our use case of binary disassembly, a logical statement can be useful for capturing the addresses in a binary that correspond to plausible function entry points: (1) targets of direct call instructions, (2) common function prologues, or (3) any function address contained in the symbol table. This use case is shown below in Dr. Lojekyll syntax:

Listing 1: This query retrieves the set of all plausible function entry points. Here, “free” denotes that the query must find all candidates that match the subsequent clauses. In a bounded clause (e.g., given some fixed address), the tag “bound” is used instead (see listing 5).

From a logical programming perspective, the code snippet above is interpreted as follows: there is a plausible function at address FuncEA if a direct call to FuncEA, a known function entry instruction sequence starting at FuncEA, or a function symbol at FuncEA exists.

At a higher level, logical and functional programming are part of a broader paradigm known as declarative programming. Unlike imperative languages (e.g., Python, Java, C, and C++), declarative languages dictate only what the output result should look like. For instance, in the previous example of retrieving function entry points, our main focus is the end result—the set of function entry points—and not the step-by-step computation needed to get there. While there is certainly more to logical and declarative programming than the condensed explanation offered here, the key advantage of logical programming is its succinct representation of data as statements.

Here’s where Datalog shines. Suppose that after populating our database of “facts”—sections, functions, and instructions—we want to make some adjustments. For example, imagine we’re analyzing a position-independent “hello world” binary with the following disassembly obtained for function

:

Listing 2: An example of a relocated call target

We also know that the following relocation entries exist:

Listing 3: Relocation entry information for the example in listing 2

At runtime, the dynamic linker will update the operand of the call at 0x526 to point to printf@PLT. When the call is taken, printf@PLT then transfers to printf’s Global Offset Table (GOT) entry, and the execution proceeds to the external printf.

If you’re familiar with IDA Pro or Binary Ninja, you’ll recognize that both tools adjust the relocated calls to point to the external symbols themselves. In the context of binary analysis, this is useful because it “fixes up” the otherwise opaque calls whose targets are revealed only through dynamic linking. In Datalog, we can simply accommodate this with a few lines:

Listing 4: This exported message rewrites the call to skip its intermediary Procedure Linkage Table (PLT) entry. Here, “#export” denotes that the message will alter some fact(s) in the Datalog database.

Voila! Our representation of the indirect call no longer requires the intermediary redirection through the PLT. As a bonus, we can maintain a relationship table to map every call of this type to its targets. With this example in mind, we envision many possibilities in which complex binary semantics are modelable through relationship tables (e.g., points-to analysis, branch target analysis, etc.) to make binary analysis more streamlined and human-interpretable.

Dr. Lojekyll is Trail of Bits’ new Datalog compiler and execution engine and the foundation on which Dr. Disassembler is built. It adopts a publish/subscribe model, in which Dr. Lojekyll-compiled programs “subscribe” to messages (e.g., there exists an instruction at address X). When messages are received, the program may then introduce new messages (e.g., there exists a fall-through branch between instructions A and B) or remove previous ones. Compiled programs may also publish messages to external parties (e.g., an independent server), which may then “query” data relationships from the Datalog side.

Dr. Lojekyll’s publish/subscribe model is well suited for tasks in which “undo”-like features are required. In binary disassembly, this opens up many possibilities in human-in-the-loop binary analysis and alterations (think Compiler Explorer but for binaries). At the time of writing, Dr. Lojekyll supports the compilation of Datalog into Python programs and has emerging support for C++.

Introducing Dr. Disassembler

Conventional “run-and-done” disassemblers perform their analyses on the fly, confining them to whatever results—even erroneous ones—are obtained from the outset. Instead, Datalog enables us to move all analysis to post-disassembly, thus streamlining the integration of plug-and-play refinements and retroactive updates. And with its painless syntax, Datalog easily represents one of the most powerful and expressive platforms for user-written disassembly plugins and extensions. We implement our vision of transparent and mutable disassembly as a prototype tool, Dr. Disassembler. While Dr. Disassembler can theoretically use any Datalog engine (e.g., DDLog), we currently use Trail of Bits’ own Dr. Lojekyll. The implementation of Dr. Disassembler discussed in this blog post uses Dr. Lojekyll’s Python API. However, at the time of writing, we have since begun developing a C++-based implementation due to Python’s many performance limitations. Here, I introduce the high-level design behind our initial (and forthcoming) implementations of Dr. Disassembler.

Figure 1: Dr. Disassembler’s high-level architecture

Disassembly Procedure

Dr. Disassembler’s disassembly workflow consists of three components: (1) parsing, (2) decoding, and (3) post-processing. In parsing, we scan the binary’s sections to pinpoint those that contain instructions, along with any recoverable metadata (e.g., entry points, symbols, and imported/exported/local functions). For every identified code section, we begin decoding its bytes as instructions. Our instruction decoding process maps each instruction to two key fields: its type (e.g., call, jump, return, and everything else) and its outgoing edges.

Recovering Control Flow

An advantage of using Datalog is the ability to express complex program semantics as a series of simple, recursive relationships. Yet, when handling control flow, a purely recursive approach often breaks certain analyses like function boundary detection: recursive analysis will follow the control flow to each instruction’s targets and resume the analysis from there. But, unlike calls, jumps are not “returning” instructions; so for inter-procedural jumps, the function will not be re-entered, thus causing the disassembler to miss the remaining instructions in the function containing the jump instruction.

To unify recursive and linear descent disassembly approaches, we developed the concept of non-control-flow successor instructions: for any unconditionally transferring jump or return instruction, we record an artificial fall-through edge from the instruction to the next sequential instruction. Though this edge has no bearing on the actual program, it effectively encodes the logical “next” instruction, thus unifying our linear and recursive analyses. These non-control-flow successor edges are the linchpin of our recursive analyses, like instruction-grouping and function boundary detection.

Post-Processing

At each step of parsing and decoding, we publish any interesting objects that we’ve found to our Dr. Lojekyll database. These core objects—symbols, sections, functions, instructions, and transfers—form the building blocks of our heuristic and recursive analyses. Our fundamental approach behind Dr. Disassembler is to “engulf” as much disassembly information as possible, regardless of correctness, and to refine everything afterward on the Datalog side. Because we consider every piece of information to be plausibly correct, we can retroactively update our disassembly when any new information is observed; and unlike conventional run-and-done tools, this does not require a de novo re-disassembly.

Example Exports and Queries

Dr. Disassembler streamlines binary analysis by focusing on disassembly artifacts themselves rather than the myriad steps needed to obtain them. To showcase some of Dr. Disassembler’s many capabilities, this section highlights several implementation examples of rigorous binary analysis tasks facilitated by two of Dr. Disassembler’s fundamental constructs: “exports” (messages that change/remove facts) and “queries” (which retrieve information about facts).

Query: Grouping Instructions into Functions

Given an arbitrary function address FuncEA, this query returns all the addresses of the instructions contained in that function. Two messages form this query: (1) function(u64 StartEA) and (2) instruction(u64 InsnEA, type Type, bytes Bytes).

Listing 5: An example Dr. Disassembler query that returns the addresses of the instructions contained in a function

Export: Instructions Dominating Invalid Instructions

This export returns all the instructions whose control flow leads to invalid instructions (i.e., where instruction decoding fails). This heuristic is critical for Dr. Disassembler to filter-out the many “junk” instruction sequences that inevitably occur when decoding every possible byte sequence.

As in the previous example, we structure this relationship around two core messages: (1) instruction and (2) raw_transfer(u64 StartEA, u64 DestEA), the latter of which contains the unaltered control flow recovered from the binary (i.e., no alterations like the one in listing 4 are made yet).

Listing 6: An example Dr. Disassembler export that updates the database with all the instructions whose control flow leads to invalid instructions

Export: Inter-Function Padding

This export returns all the instruction addresses that serve as “padding” between functions (e.g., NOPs that do not belong to any function). Here, we use the following messages: (1) function, (2) section, (3) raw_transfer, and (4) basic_block(u64 BlockEA, u64 InsnEA). Identifying inter-function padding is a crucial step to refining our function-instruction grouping.

Listing 7: An example Dr. Disassembler export that updates the database with all the instruction addresses that serve as “padding” between functions

Future Work and Extensions

Our immediate plan is to extend Dr. Disassembler to a fully C++ implementation. Along with improving the performance of the tool, we expect that this transition will open many new doors for research on binary analysis:

  • Streamlined binary analysis platforms: Contemporary binary analysis platforms have rich interfaces for developing custom analysis plugins, but the sheer complexity of their APIs frequently leaves users bottlenecked by steep learning curves. As a next step, we want to develop Dr. Disassembler into a full-fledged binary analysis platform, complete with all the features needed to facilitate the easy creation and customization of user plugins.
  • GUI interfaces for binary analysis and transformation: By using Dr. Disassembler’s mutable representation of disassembly, we can develop new interfaces that enable real-time analysis and editing of binary executables (e.g., to help developers visualize how toggling on different heuristics affects analysis results). Our end goal here is something akin to Compiler Explorer… for binaries!
  • Exposing analysis blind spots: Our prototype of Dr. Disassembler is designed to use the outputs of multiple binary parsers and instruction decoders. Going forward, we would like to use Dr. Disassembler as a platform for developing automated techniques to identify where these competing tools agree and disagree with one another (e.g., on code-data disambiguation, branch target analysis, etc.) and pinpoint their weaknesses.

If any of these ideas interest you, feel free to get in touch with either me (Stefan Nagy) or Peter Goodman at Trail of Bits.

We’ll release our prototype Python implementation of Dr. Disassembler and provide a PDF version of this post at https://github.com/lifting-bits/dds. Happy disassembling!

Celebrating our 2021 Open Source Contributions

At Trail of Bits, we pride ourselves on making our best tools open source, such as algo, manticore, and graphtage. But while this post is about open source, it’s not about our tools

In 2021, Trail of Bits employees submitted over 190 pull requests (PRs) that were merged into non-Trail of Bits repositories. This demonstrates our commitment to securing the software ecosystem as a whole and to improving software quality for everyone. A representative list of contributions appears at the end of this post, but here are some highlights:

We would like to acknowledge that submitting a PR is only a tiny part of the open source experience. Someone has to review the PR. Someone has to maintain the code after the PR is merged. And submitters of earlier PRs have to write tests to ensure the functionality of their code is preserved.

We contribute to these projects in part because we love the craft, but also because we find these projects useful. For this, we offer the open source community our most sincere thanks, and wish everyone a happy, safe, and productive 2022!

Osquery description updated on January 3, 2022.

Some of Trail of Bits’ 2021 Open Source Contributions