Discussion on: Advice on using a header in text to re-organize a large dataset

View post

Andy Zhao (he/him) • Aug 14 '19 • Edited

What's the file format of each gene file? If it's something like a .csv you could probably run each file through a script that detects the header line, and then add it to an array. Once you've ran through all your files, you would have that array and then write a new file like all_gene1.csv or something.

Since you have key-value pairs already with the header and the genetic data, you already have a good data structure that you can work off of. I think Ruby can handle this pretty well, but I'd have to play with the data in order to figure out if it's really possible, and you probably can't hand over the data so easily. 🙃

Edit: actually, the file format probably doesn't matter for Ruby. I think you can read line by line in Ruby regardless of file format.

Fernando B 🚀 • Aug 14 '19

Same with python read all lines method, though she said 20K gene ids, I am wondering how many lines is each gene id? I think if these genes are very large I would run a python or ruby script looking for gene1 ids add to array then add to a new file. If memory becomes an issue you will have to flush the array let's say every 1M lines or a good number your system can handle.

split_genes.py -n Gene1_* -l 1000000 -o Gene1.txt