Downloading PDF documents (and getting around the blob URL)

#ruby #automation #testing #selenium

Many apps out there continue to produce downloadable files at the click of a button. This is especially true in industries that are perhaps a little stuck in the past, and their users and customers still find comfort in downloading that data to Excel (yuck) or having a PDF document of the literal thing you're looking at (literally my dad, every time).

We're going to talk about PDF documents specifically here.

When triggering a download of a PDF doc, that file might typically live on a file server with an easily accessible, direct URL. In those cases, downloading the file is pretty straightforward:



  download = URI.open(pdf_url)
  IO.copy_stream(download, file_path = "./tmp/my_file.pdf")

From there you can do whatever you need with the PDF doc. In some cases, a PDF Reader library might work for you. In our case, we send these type of docs over to AWS Textract.

What if there isn't a pdf_url for you to work with? In some cases, a download is triggered immediately after clicking a button. When inspecting where the download originates from, it looks like it comes from a Blob URL:

Bear with me through this rocky ride while we go through the next few steps.

In more recent versions of Chrome (>= 106), the remainder of this solution only works when opening a new tab. For whatever reason, opening the blob URL in the same window causes some sort of expiration of the link (a fun breaking change to investigate!) So here we go:



  new_window = open_new_window
  within_window new_window do
    visit 'chrome://downloads'
    sleep 3
    file_name = page.text.split("\n")[3]
    blob_url = page.text.split("\n")[4]
    visit blob_url
    ...

At this point, you'll have the PDF doc rendered in its own tab in Chrome. Now we'll write a method to use some Javascript (while using Ruby!) to decode the file to a base64 string.



def get_file_content_in_base64(uri)
  result = page.evaluate_async_script("
    var uri = arguments[0];
    var callback = arguments[1];
    var toBase64 = function(buffer){for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)i[c]='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'.charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder('ascii').decode(a)};
    var xhr = new XMLHttpRequest();
    xhr.responseType = 'arraybuffer';
    xhr.onload = function(){ callback(toBase64(xhr.response)) };
    xhr.onerror = function(){ callback(xhr.status) };
    xhr.open('GET', uri);
    xhr.send();
    ", uri)
  if result.is_a? Integer
    fail 'Request failed with status %s' % result
  end
  return result
end

And then continuing from our earlier block of code, you can pass the blob_url to that method:



    ...
    base64_str = get_file_content_in_base64(blob_url)
    decoded_content = Base64.decode64(base64_str)
    file_path = "./tmp/#{file_name}"
    File.open(file_path, "wb") do |f|
      f.write(decoded_content)
    end
  end

And there you have it! You'll have your PDF doc decoded to base64 and then streamed back to a file locally. The best part about this solution is that you won't need to worry about accessing downloaded files from any remote WebDriver servers if that's what you're using (which you should be!)

DEV Community

Downloading PDF documents (and getting around the blob URL)

Oldest comments (0)

Read next

Connecting LLM to a Real-World Robot

Revolutionizing Content Creation: The Role of AI Image Processing in Entertainment

Patching the Cracks

Non-Functional Testing: Load and Stress Tests with K6