Categorize Csv Files Based On $18 Info And Find The Csv File In Each Category Which Has The Largest Unique Number In $4
Solution 1:
Updated Answer
In the light of your comments, I have had a try at reworking the awk
to match your new needs. I would probably encode them something like this:
#!/bin/bash# Do enzymatic/biochemical firstfor f in *.csv; do
awk -F, -v IGNORECASE=1 'NR>1 && ($18 ~ "enzymatic" || $18 ~ "biochemical") && $12<=10 {print $12,FILENAME}'"$f"done | sort -n | tail -3
# Now do cell typesfor f in *.csv; do
awk -F, -v IGNORECASE=1 'NR>1 && $18 ~ "cell" && $12<=10 {print $12,FILENAME}'"$f"done | sort -n | tail -3
However, I think the following may be more efficient and easier
egrep -Hi"enzyme|biochemical"*.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3
grep -Hi"cell"*.csv | awk -F, '$12<=10{split($1,a,":");filename=a[1];print filename,$12}' | sort -n | tail -3
Original Answer
I think this is what you mean!
#!/bin/bashfor f in *.csv; do
res=$(awk -F',''
BEGIN{IGNORECASE=1;field18ok=0}
$18 ~ "enzymatic" || $18 ~ "biochemical" || $18 ~ "cell" {field18ok=1}
NR>1{if(!col4[$4]++)u++}
END{print field18ok * u}'"$f")
echo$res:$fdone | sort -n
It cycles through all .csv
files, and passes them one at a time into awk
.
If any line has one of your 3 keywords (upper or lower case) in field 18 it sets a flag to say field 18 is ok and one of the ones you are looking for. If field 18 is not one of the ones you are looking for, the variable fiedl18ok
will stay set at zero and will make the answer
printed at the end equal to zero.
The next part, starting NR>1
applies only to lines where the line number is greater than one, so it basically ignores the header line of the input file. It then sums unique values in column 4 by remembering all the values it has already seen in column 4 in an array called col4[]
. So, the first time I add 1 to this array, I increment u
(the number of unique things I have seen in field 4.
At the end, (END{}
) it multiplies field18ok by the number of unique compounds in column 4. So, if the field18 is not one that you want, the answer will be zero, whereas if field 18 is one of the values you are looking for, it will be the number of unique values in field 4.
The output is then sorted numerically so you can pick the highest value easily.
Solution 2:
This code reads all files data, and then get it's 18th position (index 17 because is zero based) and add to a dict with filename key the compound if match the value condition. I've used a set, because this structure doesn't store duplicate values. Finally, you only have to check all sets value to know which has the max unique values
import csv
files = ['Test_95_target_1334_assay_Detail3.csv','Test_95_target_1338_assay_Detail3.csv', 'Test_95_target_2888_assay_Detail3.csv']
pos_to_check =17 #zero based index
pos_compound =3
values_to_check = ["enzymatic", "biochemical" , "cell"]
result= dict([(file,set([])) for file in files ]) #file : setof compounds
for file in files:
withopen(file, 'r') as csvfile:
csvreader = csv.reader(csvfile)
forrowin csvreader:
if row[pos_to_check].lower() in values_to_check:
result[file].add(row[pos_compound])
#get key which has more elements
max(result.iterkeys(), key=(lambda key: len(result[key])))
Post a Comment for "Categorize Csv Files Based On $18 Info And Find The Csv File In Each Category Which Has The Largest Unique Number In $4"