Discussion on: Advice on using a header in text to re-organize a large dataset

View post

From what you wrote, I assume that you are looking for an intermediate data structure to be able to further process your data, is that correct?

The second question is: do you strictly want to stick to plain files?

And finally, is searching or retrieving of files an issue for you? (Is it enough to know the gene and the id if a file or do you want to be able to list all ids for gene x?

If you want to stick to one file per gene and id and the number of files inside one folder is an issue, you could create subfolders, either by gene name or if that's still too much, do a breakdown by the first letter of the gene name first, then have a folder with the gene name and finally the ids in there.

If you want to be more flexible with what you do with your data, you can parse the data into a csv file or sqlite database. There, you can have one column for the gene, one for the id and one for the actual generic information. Then you can iterate over the lines (which is a bit more comfortable with sqlite) depending on your desired analysis.

If I got something wrong or you need to achieve something more with your data, I am happy to hear about it.

Robin Palotai • Aug 14 '19

+1 for sqlite, will make your life easier down the line.