DEV Community

djmitche
djmitche

Posted on

Chromium Spelunking: A Stuck Task

The last installment was a bit of a detour from our task -- fetching a URL. But, it was a deliberate one, to learn about things I didn't understand well enough. Before that, I had gotten churl up to the point where it crashed because its delegate was NULL. So, let's give it a delegate!

Delegates

Delegates are a common pattern in Chromium, forming a way for lower-level code to interact with higher-level code. It's a kind of dependency inversion. For example, an embedder like Chrome or Fuchsia provides a delegate (several, in fact) to the content layer, which that content layer uses to call back for embedder-related features like autofill.

In this case, we're providing a delegate to the URLRequest, a subclass of net::URLRequest::Delegate. Happily, most of the methods have default implementations, and in fact the only required method is OnReadCompleted. My delegate overrides that method as well as OnConnected and OnResponseReceived, just to log when those events occur. In fact, I copied from TestDelegate, including the buf_ stuff and the Read calls.

So, at this point I have a delegate:

class ChurlDelegate : public net::URLRequest::Delegate {
 public:
  ChurlDelegate()
    : buf_(base::MakeRefCounted<net::IOBuffer>(kBufferSize)) {}
  ~ChurlDelegate() override {}
  int OnConnected(net::URLRequest* request,
                  const net::TransportInfo& info,
                  net::CompletionOnceCallback callback) override {
    DLOG(ERROR) << "OnConnected";
    return net::URLRequest::Delegate::OnConnected(request, info, std::move(callback));
  }

  void OnResponseStarted(net::URLRequest* request, int net_error) override {
    DLOG(ERROR) << "OnResponseStarted";
    if (net_error != net::OK) {
      DLOG(ERROR) << "Error: " << net_error;
      request->Cancel();
      return;                                                                                                                                                                                                      
    }
    // TODO: print response and headers?

    int bytes_read = request->Read(buf_.get(), kBufferSize);
    if (bytes_read >= 0) {
      DLOG(ERROR) << "Read completed immediately";
      OnReadCompleted(request, bytes_read);
    } else if (bytes_read != net::ERR_IO_PENDING) {
      DLOG(ERROR) << "Error from READ: " << bytes_read;
      request->Cancel();
      return;
    }
  }

  void OnReadCompleted(net::URLRequest* request, int bytes_read) override {
    DLOG(ERROR) << "OnReadCompleted with " << bytes_read << " bytes_read";
    do {
      DLOG(ERROR) << "GOT: " << buf_->data();                                                                                                                                                                      
      bytes_read = request->Read(buf_.get(), kBufferSize);
    } while (bytes_read > 0);
  }
 private:
  scoped_refptr<net::IOBuffer> buf_;
};
Enter fullscreen mode Exit fullscreen mode

and the Churl class:

class Churl {
 public:
  void Fetch() {
    net::URLRequestContextBuilder url_request_context_builder;
    url_request_context_builder.set_user_agent("Dustin's Experiment");
    url_request_context_builder.set_proxy_config_service(
      std::make_unique<ProxyConfigServiceDirect>());

    auto url_request_context = url_request_context_builder.Build();

    auto url_request = url_request_context->CreateRequest(
        GURL("http://httpbin.org/get"), // TODO: get this from command line
        net::RequestPriority::HIGHEST,
        &delegate_,                                                                                                                                                                                                
        net::NetworkTrafficAnnotationTag(net::MutableNetworkTrafficAnnotationTag(TRAFFIC_ANNOTATION_FOR_TESTS)),
        false);
    DLOG(ERROR) << "calling start";
    url_request->Start();
    DLOG(ERROR) << "started";
  }

 private:
  std::unique_ptr<net::URLRequestContext> url_request_context_;
  ChurlDelegate delegate_;
};
Enter fullscreen mode Exit fullscreen mode

and all of this created in main with

  Churl churl;
  base::ThreadPool::CreateSingleThreadTaskRunner({})->PostTask(
      FROM_HERE, base::BindOnce(&Churl::Fetch, base::Unretained(&churl)));
Enter fullscreen mode Exit fullscreen mode

Note that the delegate is passed to CreateRequest as a raw pointer. It's up to me to make sure that the delegate continues to exist for the duration of that request, and it will crash otherwise. Ask me how I know!

I put it into an instance variable in Churl, since that Churl instance lives until the end of main. The receiver for the Fetch method needs to be included in the task callback, and that's done with base::Unretained to indicate that the callback does not "own" the pointer.

An Easy Fix

So, the first run fails with this:

[0525/191840.310712:FATAL:command_line.cc(247)] Check failed: current_process_commandline_. 
Enter fullscreen mode Exit fullscreen mode

A little exploring of other "utility" tools like root_store_tool suggests

  base::CommandLine::Init(argc, argv);                                                                                                                                                                             
Enter fullscreen mode Exit fullscreen mode

And indeed, this gets things running -- sort of.

[0530/190907.523753:ERROR:churl_bin.cc(190)] URLRequestContext created: 0x7f4874003780
[0530/190907.524119:ERROR:churl_bin.cc(198)] calling start    
[0530/190907.526291:ERROR:churl_bin.cc(200)] started           
Enter fullscreen mode Exit fullscreen mode

That's it -- no logging from OnConnected. Maybe not so easy?

Finding a Stuck Task

I am sort of old-fashioned, and I'd rather debug things by running them and watching the output, than using fancy tools like lldb. My usual tactic is to add lots of debug logging (in Chromium, DLOG(ERROR) << "got here 1") to narrow down where a problem occurs. In this case, Start() started things off, but got "stuck" somewhere.

Looking at URLRequest::Start, it creates a new job with a job factory, and I can verify that it does, indeed call StartJob(..) with this new job by adding a debug print after that line.

That job is a subclass of URLRequestJob, but which subclass? Codesearch gives a list of subclasses, and URLRequestHttpJob seems a likely suspect. Adding some debug prints there confirms this guess. URLRequestJob then calls job->Start().

URLRequestHttpJob does its work in a series of callback tasks. Each possible bit of work is a separate method, and any conditionals or loops are accomplished "manually" by conditionally calling or posting a task for a different method. It's almost like assembly language, tracing out all of the conditional jumps and control-flow cycles. In this case, adding debug prints to the beginning of each method quickly shows that StartTransactionInternal is the last method called.

HttpRequestJob::Start

The StartTransactionInternal method seems to set up a transaction and call its Start method with a callback bound to URLRequestHttpJob::OnStartCompleted. If the Start method finishes immediately, it calls OnStartCompleted directly. This answers my question from the first post: when a function takes a callback, but does not return ERR_IO_PENDING, it also does not invoke the callback and expects its caller to do so. This seems like it would be a source of bugs, if one of those two control-flow paths is rarely taken.

Anyway, OnStartCompleted is never called, and one more debug print shows that Start is returning ERR_IO_PENDING so Start should be calling it.

I briefly wondered if this was another problem with threads and tasks, where the process was exiting before the task had a chance to run. Maybe FlushForTesting sees no tasks running while some kind of IO is pending, and returns? I added a sleep in the main task (base::PlatformThread::Sleep(base::Seconds(5))) and OnStartCompleted still did not run, so it's nothing quite that simple.

So let's dive one layer deeper. URLRequestHttpJob is using a factory to make an instance of an interface, in this case an HttpTransactionFactory creating an HttpTransaction.There's no easy way to print the class name of an object, but codesearch can (usually) list the subclasses of an interface, so a bit of guess-and-check works. My first guess was an HttpNetworkTransaction, but a debug print in its Start method did not appear. Then I remembered that the HttpCache is a "read-through" cache, and a debug print confirms that the HttpTransaction is an HttpCache::Transaction.

HttpTransaction::Start

The HttpCache::Transaction class uses the DoLoop pattern. I used a quick vim macro to print the states as they occurred:

[0525/195452.887052:ERROR:http_cache_transaction.cc(861)] STATE_GET_BACKEND
[0525/195452.887096:ERROR:http_cache_transaction.cc(866)] STATE_GET_BACKEND_COMPLETE
[0525/195452.887189:ERROR:http_cache_transaction.cc(870)] STATE_INIT_ENTRY
[0525/195452.887230:ERROR:http_cache_transaction.cc(875)] STATE_OPEN_OR_CREATE_ENTRY
[0525/195452.887406:ERROR:http_cache_transaction.cc(880)] STATE_OPEN_OR_CREATE_ENTRY_COMPLETE
[0525/195452.887450:ERROR:http_cache_transaction.cc(902)] STATE_ADD_TO_ENTRY
Enter fullscreen mode Exit fullscreen mode

So, it's stuck after DoAddToEntry. More debug prints there show that it's returning the rv from cache_->AddTransactionToEntry. cache_ is an HttpCache, so off we go!

HttpCache::AddTransactionToEntry

  // Adds a transaction to an ActiveEntry. This method returns ERR_IO_PENDING
  // and the transaction will be notified about completion via its IO callback.
  // In a failure case, the callback will be invoked with ERR_CACHE_RACE.
  int AddTransactionToEntry(ActiveEntry* entry, Transaction* transaction);
Enter fullscreen mode Exit fullscreen mode

So, what's the transaction's "IO callback" in this case? It has an io_callback_ property, and elsewhere in http_cache.cc I see transaction->io_callback().Run(OK) so at a guess that's what it's calling. But, what is that set to at the time? The HttpCache::Transaction constructor sets this to a binding for the OnIOComplete method, and it seems it's only set to other things in order to support testing. And that just calls DoLoop. And that would log. So, what's going on in the cache that causes it not to call the IO callback?

Adding some more debug prints to HttpCache, it seems that ProcessQueuedTransactions is executing, and posting a task, but HttpCache::OnProcessQueuedTransactions never executes.

I got stuck here for a bit, and decided I'd put in enough work to ask for help. Pairing with a colleague, we pored over the code, looking for what I might have missed. Nothing jumped out as obviously wrong.

As an experiment, we tried posting another task that just contained a lambda expression logging a message. That task ran! So, this isn't an issue of the thread pool not running tasks.

  base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
      FROM_HERE,
      base::BindOnce(&HttpCache::OnProcessQueuedTransactions, GetWeakPtr(),
                     base::UnsafeDanglingUntriaged(entry)));
Enter fullscreen mode Exit fullscreen mode

The GetWeakPtr() in the PostTask invocation is curious, though -- what happens if that pointer becomes invalid before the task runs? Replacing it with base::Unretained(this) "fixes" the issue: the callback gets called. So it seems like the task machinery detects that a weak pointer has been invalidated and drops the task depending on it -- a pretty smart way to avoid use-after-free errors! That means that the HttpCache is getting destroyed before this callback can begin.

HttpCache Lifetime

So, let's see if we can figure out what's creating that and how long it lasts. It looks like HttpCache is a subclass of HttpTransactionFactory, and the URLRequestContext references it as http_transaction_factory_. That's set by URLRequestContextBuilder.

Ah, the URLRequestContext is in a local variable so it is being destroyed while the URLRequest is still in progress. So that's no good. Rearranging things to keep the context as an instance variable of the Churl class fixes the issue.

Which is not to say it's working, but that's a job for the next post in the series:

[0605/175824.724186:ERROR:churl_bin.cc(190)] URLRequestContext created: 0x7fb224002df0
[0605/175824.725417:ERROR:churl_bin.cc(198)] calling start     
[0605/175824.730763:ERROR:churl_bin.cc(200)] started     
[0605/175824.734067:FATAL:current_thread.cc(197)] Check failed: sequence_manager. 
#0 0x7fb23107ca8c base::debug::CollectStackTrace()
#1 0x7fb2310332da base::debug::StackTrace::StackTrace()                                                
#2 0x7fb231033295 base::debug::StackTrace::StackTrace()       
#3 0x7fb230d575f9 logging::LogMessage::~LogMessage()           
#4 0x7fb230d02bac logging::(anonymous namespace)::DCheckLogMessage::~DCheckLogMessage()
#5 0x7fb230d02bd9 logging::(anonymous namespace)::DCheckLogMessage::~DCheckLogMessage()
#6 0x7fb230d028bd logging::CheckError::~CheckError() 
#7 0x7fb230ed6cc2 base::CurrentIOThread::Get()     
#8 0x7fb232194f2d net::SocketPosix::Connect()                                                            
#9 0x7fb232199906 net::TCPSocketPosix::Connect()                                                         
Enter fullscreen mode Exit fullscreen mode

[update: it turns out it's also necessary to keep the URLRequest in an instance variable]

Top comments (0)