Categorize Csv Files Based On $18 Info And Find The Csv File In Each Category Which Has The Largest Unique Number In $4

June 16, 2024 Post a Comment

If we have three example input files: Test_95_target_1334_assay_Detail3.csv A,accession,result_id,cpd_number,lot_no,assay_id,alt_assay_id,version_no,result_type,type_desc,operator,

Solution 1:

Updated Answer

In the light of your comments, I have had a try at reworking the awk to match your new needs. I would probably encode them something like this:

#!/bin/bash# Do enzymatic/biochemical firstfor f in *.csv; do
   awk -F, -v IGNORECASE=1 'NR>1 && ($18 ~ "enzymatic" || $18 ~ "biochemical") && $12<=10 {print $12,FILENAME}'"$f"done | sort -n | tail -3

# Now do cell typesfor f in *.csv; do
   awk -F, -v IGNORECASE=1 'NR>1 && $18 ~ "cell" && $12<=10 {print $12,FILENAME}'"$f"done | sort -n | tail -3

However, I think the following may be more efficient and easier

egrep -Hi"enzyme|biochemical"*.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3

grep -Hi"cell"*.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3

Original Answer

I think this is what you mean!

#!/bin/bashfor f in *.csv; do
   res=$(awk -F',''
          BEGIN{IGNORECASE=1;field18ok=0} 
          $18 ~ "enzymatic" || $18 ~ "biochemical" || $18 ~ "cell" {field18ok=1}
          NR>1{if(!col4[$4]++)u++}
          END{print field18ok * u}'"$f")
   echo$res:$fdone | sort -n

It cycles through all .csv files, and passes them one at a time into awk.

If any line has one of your 3 keywords (upper or lower case) in field 18 it sets a flag to say field 18 is ok and one of the ones you are looking for. If field 18 is not one of the ones you are looking for, the variable fiedl18ok will stay set at zero and will make the answer printed at the end equal to zero.

Baca Juga

The next part, starting NR>1 applies only to lines where the line number is greater than one, so it basically ignores the header line of the input file. It then sums unique values in column 4 by remembering all the values it has already seen in column 4 in an array called col4[]. So, the first time I add 1 to this array, I increment u (the number of unique things I have seen in field 4.

At the end, (END{}) it multiplies field18ok by the number of unique compounds in column 4. So, if the field18 is not one that you want, the answer will be zero, whereas if field 18 is one of the values you are looking for, it will be the number of unique values in field 4.

The output is then sorted numerically so you can pick the highest value easily.

Solution 2:

This code reads all files data, and then get it's 18th position (index 17 because is zero based) and add to a dict with filename key the compound if match the value condition. I've used a set, because this structure doesn't store duplicate values. Finally, you only have to check all sets value to know which has the max unique values

import csv
files        = ['Test_95_target_1334_assay_Detail3.csv','Test_95_target_1338_assay_Detail3.csv', 'Test_95_target_2888_assay_Detail3.csv']
pos_to_check =17 #zero based index
pos_compound =3
values_to_check = ["enzymatic", "biochemical" , "cell"]
result= dict([(file,set([])) for file in files ]) #file : setof compounds

for file in files:   
    withopen(file, 'r') as csvfile:
        csvreader = csv.reader(csvfile)
        forrowin csvreader:   
            if row[pos_to_check].lower() in values_to_check:
                result[file].add(row[pos_compound])

#get key which has more elements
max(result.iterkeys(), key=(lambda key: len(result[key])))

Learn Python Tutorials

Categorize Csv Files Based On $18 Info And Find The Csv File In Each Category Which Has The Largest Unique Number In $4

Solution 1:

Solution 2:

Post a Comment for "Categorize Csv Files Based On $18 Info And Find The Csv File In Each Category Which Has The Largest Unique Number In $4"