Subset Recursively A Data.frame
I have a data frame with close to a 4 million of rows in it. I need an efficient to way to subset the data based on two criteria. I can do this is a for loop but was wondering if t
Solution 1:
Here is a partial solution ir R using data.table
, which is probably the fastest way to go in R when dealing with large datasets.
library(data.table) # v1.9.7 (devel version)df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table
1st step
# Filter data under threshold 0.05 and Sort by CHR, POSdf <- df[ P < 0.05, ][order(CHR, POS)]
2nd step
df[, {idx = (1:.N)[which.min(P)]
SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]
Saving output in different files
df[, fwrite(copy(.SD)[, SNP := SNP], paste0("output", SNP,".csv")), by = SNP]
ps. note that this answer uses fwrite
, which is still in the development version of data.table
. Go here for install instructions. You could simply use write.csv
, however you're dealing with a big dataset so speed is quite important and fwrite
is certainly one of the fastest alternatives.
Post a Comment for "Subset Recursively A Data.frame"