Hannah Frank

Posted on Aug 13, 2019

Advice on using a header in text to re-organize a large dataset

#help #bash

Hi Dev Community,

I have over 100 files in the following structure (called a FASTA -- very familiar to those of you who spend time looking at genetic data):

Original File 1:

>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG

>Gene2_id1
ATGCATGCAGC
ACTAGGCCACG
CCGTAGCGGAC

>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA

Each of these ~100 files contain ~20,000 of these genes. The problem is that my files are organized such that Gene1 IDs are blended alongside Gene2 IDs.

For my analysis, I need all of my Gene1 IDs organized in one place. I will ideally end up with one file for each GeneX, like so:

Desired Final Gene1 File:

>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG

>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA

The length of sequence varies between genes and between individuals within genes, so I need all the lines below a header line and above the next header line to be associated with the header.

My current solution has been to take each file, and then create a new file based on the header of each line. So the first file creates three new files: one for >Gene1_id1, one for >Gene2_id1, and one for Gene1_id2. From there, I was planning on re-organizing to suit my needs.

The problem with the above approach is that it has created ~800,000 similarly-named files which are killing my computer. There must be a better way.

Any advice on how to proceed? Thanks!!!

-Hannah

Top comments (7)

Molly Struve (she/her) • Aug 14 '19

I made a couple of assumptions to generate a way to split the files by genes
Assumptions:

You are working with plain .txt files
You want the new organized genes in plain .txt files (there are a million other things you may want but I figured this was the simplest)

Rather than doing this in bash bc that is far from my strong suit I wrote a simple ruby script that takes the raw files from one folder and parses them by Gene and writes the new files to a new folder. With this you only end up with 1 file for each Gene.

path = 'genes'
Dir.foreach(path) do |filename|
  next if filename == '.' || filename == '..' || filename == '.DS_Store'
  gene_file = nil
  header = true
  puts "working on #{filename}"

  file = File.open("#{path}/#{filename}", 'r')
  file.each_line do |line|
    puts line
    if line.empty? || line == "\n"
      puts "line empty"
      header = true
      gene_file&.close
    elsif header
      puts "opening new file"
      gene_file = File.open("new_gene_files/#{line.gsub(/_id\d/, '').gsub('>', '').strip}.txt", 'a')
      puts "adding to file"
      gene_file.puts line
      header = false
    else
      puts "adding to file"
      gene_file&.puts line
    end
  end

  file.close
end

Philip • Aug 14 '19

From what you wrote, I assume that you are looking for an intermediate data structure to be able to further process your data, is that correct?

The second question is: do you strictly want to stick to plain files?

And finally, is searching or retrieving of files an issue for you? (Is it enough to know the gene and the id if a file or do you want to be able to list all ids for gene x?

If you want to stick to one file per gene and id and the number of files inside one folder is an issue, you could create subfolders, either by gene name or if that's still too much, do a breakdown by the first letter of the gene name first, then have a folder with the gene name and finally the ids in there.

If you want to be more flexible with what you do with your data, you can parse the data into a csv file or sqlite database. There, you can have one column for the gene, one for the id and one for the actual generic information. Then you can iterate over the lines (which is a bit more comfortable with sqlite) depending on your desired analysis.

If I got something wrong or you need to achieve something more with your data, I am happy to hear about it.

Robin Palotai • Aug 14 '19

+1 for sqlite, will make your life easier down the line.

Peter Kim Frank • Aug 13 '19

FYI, this is my sister 🙌.

Hope you get some helpful replies, Hannah!

Andy Zhao (he/him) • Aug 14 '19 • Edited

What's the file format of each gene file? If it's something like a .csv you could probably run each file through a script that detects the header line, and then add it to an array. Once you've ran through all your files, you would have that array and then write a new file like all_gene1.csv or something.

Since you have key-value pairs already with the header and the genetic data, you already have a good data structure that you can work off of. I think Ruby can handle this pretty well, but I'd have to play with the data in order to figure out if it's really possible, and you probably can't hand over the data so easily. 🙃

Edit: actually, the file format probably doesn't matter for Ruby. I think you can read line by line in Ruby regardless of file format.

Fernando B 🚀 • Aug 14 '19

Same with python read all lines method, though she said 20K gene ids, I am wondering how many lines is each gene id? I think if these genes are very large I would run a python or ruby script looking for gene1 ids add to array then add to a new file. If memory becomes an issue you will have to flush the array let's say every 1M lines or a good number your system can handle.

split_genes.py -n Gene1_* -l 1000000 -o Gene1.txt

Hannah Frank • Aug 16 '19

Thank you all! It's in a .fasta file which is basically a plain text file. This is all really interesting -- I'll definitely look into sqlite and how to convert the data into a CSV so it's easier to work with. Also seems like a good time to brush up on my python! :)

DEV Community

Advice on using a header in text to re-organize a large dataset

Top comments (7)

Read next

Build an End-to-End AWS Web Application

Integra múltiples APIs de IA en una sola plataforma

Who is going to make Vision Pro apps with Unity?

Hiring React JS Dev