Hi Dev Community,
I have over 100 files in the following structure (called a FASTA -- very familiar to those of you who spend time looking at genetic data):
Original File 1:
Each of these ~100 files contain ~20,000 of these genes. The problem is that my files are organized such that
Gene1 IDs are blended alongside
For my analysis, I need all of my
Gene1 IDs organized in one place. I will ideally end up with one file for each
GeneX, like so:
The length of sequence varies between genes and between individuals within genes, so I need all the lines below a header line and above the next header line to be associated with the header.
My current solution has been to take each file, and then create a new file based on the header of each line. So the first file creates three new files: one for
>Gene1_id1, one for
>Gene2_id1, and one for
Gene1_id2. From there, I was planning on re-organizing to suit my needs.
The problem with the above approach is that it has created ~800,000 similarly-named files which are killing my computer. There must be a better way.
Any advice on how to proceed? Thanks!!!