Skip to content Skip to sidebar Skip to footer

How To Extract Dataframe By Row Values By Conditions With Other Columns?

I have a dataframe as follows: #values a=['003C', '003P1', '003P1', '003P1', '004C', '004P1', '004P2', '003C', '003P2', '003P1', '003C', '003P1', '003P2', '003C', '003P1', '004C',

Solution 1:

Solution

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')

m  = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)

df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

Explanations

Extract the columns INT and STR by using str.extract with a regex pattern

>>> df[['INT','STR']]

    INT STR
0   003   C
1   003  P1
2   003  P1
3   003  P1
4   004   C
5   004  P1
6   004  P2
7   003   C
8   003  P2
9   003  P1
10  003   C
11  003  P1
12  003  P2
13  003   C
14  003  P1
15  004   C
16  004  P2
17  001   C
18  001  P1

Create a boolean mask using isin to check for the condition where the extracted column STR contains only the values C, P1 and P2

>>> m

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
Name: STR, dtype: bool

Compare STR column with C to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using any to create a boolean mask m1

>>> m1
0      True
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
Name: STR, dtype: bool

Mask the values in column STR where the boolean mask m1 is False then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using nunique then chain with ge to create a boolean mask m2

>>> m2

0     False
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
Name: STR, dtype: bool

Now take the logical and of the masks m, m1 and m2, and use this to filter the required rows in the dataframe

>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

   Sample  CHROM        POS REF ALT
0    003C   chr1     125895   T   A
1   003P1   chr1     125895   T   A
2    004C  chr11    1163940   C   G
3   004P1  chr11    1163940   C   G
4   004P2  chr11    1163940   C   G
5    004C  chr11    2587895   C   G
6   004P2  chr11    2587895   C   G
7    003C  chr11    5986513   G   A
8   003P2  chr11    5986513   G   A
9   003P1  chr11    5986513   G   A
10   001C   chr9   14587952   T   C
11  001P1   chr9   14587952   T   C
12   003C   chr1  248650751   T   A
13  003P1   chr1  248650751   T   A
14  003P2   chr1  248650751   T   A

Post a Comment for "How To Extract Dataframe By Row Values By Conditions With Other Columns?"