How To Extract Dataframe By Row Values By Conditions With Other Columns?
I have a dataframe as follows: #values a=['003C', '003P1', '003P1', '003P1', '004C', '004P1', '004P2', '003C', '003P2', '003P1', '003C', '003P1', '003P2', '003C', '003P1', '004C',
Solution 1:
Solution
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
m = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)
df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
Explanations
Extract
the columns INT
and STR
by using str.extract
with a regex pattern
>>> df[['INT','STR']]
INT STR
0 003 C
1 003 P1
2 003 P1
3 003 P1
4 004 C
5 004 P1
6 004 P2
7 003 C
8 003 P2
9 003 P1
10 003 C
11 003 P1
12 003 P2
13 003 C
14 003 P1
15 004 C
16 004 P2
17 001 C
18 001 P1
Create a boolean mask using isin
to check for the condition where the extracted column STR
contains only the values C
, P1
and P2
>>> m
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Compare STR
column with C
to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using any
to create a boolean mask m1
>>> m1
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Mask the values in column STR
where the boolean mask m1
is False
then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using nunique
then chain with ge
to create a boolean mask m2
>>> m2
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
Name: STR, dtype: bool
Now take the logical and
of the masks m
, m1
and m2
, and use this to filter the required rows in the dataframe
>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
Sample CHROM POS REF ALT
0 003C chr1 125895 T A
1 003P1 chr1 125895 T A
2 004C chr11 1163940 C G
3 004P1 chr11 1163940 C G
4 004P2 chr11 1163940 C G
5 004C chr11 2587895 C G
6 004P2 chr11 2587895 C G
7 003C chr11 5986513 G A
8 003P2 chr11 5986513 G A
9 003P1 chr11 5986513 G A
10 001C chr9 14587952 T C
11 001P1 chr9 14587952 T C
12 003C chr1 248650751 T A
13 003P1 chr1 248650751 T A
14 003P2 chr1 248650751 T A
Post a Comment for "How To Extract Dataframe By Row Values By Conditions With Other Columns?"