Skip to content Skip to sidebar Skip to footer

Subset Recursively A Data.frame

I have a data frame with close to a 4 million of rows in it. I need an efficient to way to subset the data based on two criteria. I can do this is a for loop but was wondering if t

Solution 1:

Here is a partial solution ir R using data.table, which is probably the fastest way to go in R when dealing with large datasets.

library(data.table) # v1.9.7 (devel version)df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table

1st step

# Filter data under threshold 0.05 and Sort by CHR, POSdf <- df[ P < 0.05, ][order(CHR, POS)]

2nd step

df[, {idx = (1:.N)[which.min(P)]
      SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]

Saving output in different files

df[, fwrite(copy(.SD)[, SNP := SNP], paste0("output", SNP,".csv")), by = SNP]

ps. note that this answer uses fwrite, which is still in the development version of data.table. Go here for install instructions. You could simply use write.csv, however you're dealing with a big dataset so speed is quite important and fwrite is certainly one of the fastest alternatives.

Post a Comment for "Subset Recursively A Data.frame"