Pandas DataFrame [cell=(label,value)], Split Into 2 Separate Dataframes
I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes. Notice how each c
Solution 1:
you can do it this way:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
and DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9
0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7
0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2
RegEx debugging and explanation:
we basically want to remove everything, except number in parentheses:
(\d+\.*\d*)
- group(1) - it's our number
\((\d+\.*\d*)\)
- our number in parentheses
.*\((\d+\.*\d*)\).*
- the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell
it will be replaced with the group(1) - our number
Post a Comment for "Pandas DataFrame [cell=(label,value)], Split Into 2 Separate Dataframes"