Strange Behaviour. A Single Negative Value Generated In Pandas Dataframe, When New Column Is Created
Following is the head of the Pandas dataframethat I am working on. test.head() Country_or_Other TotalCases TotalTests 9 USA 2026493 21725064 10 Br
Solution 1:
Pandas here convert column to int32
instead int64
, so after multiple by 100
output is wrong:
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int)
Then get int32
and after multiple by 100
get negative values:
test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]
print (test.dtypes)
Country_or_Other object
TotalCases int64
TotalTests int32
dtype: object
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']
df = test[test['Country_or_Other'] == 'USA']
print (df)
Country_or_Other TotalCases TotalTests TotalCases_Percent \
9 USA 2026493 21725064 28.18313
TotalTests_Percent
9 -2122460896
Solution is convert to np.int64
:
df['TotalDeaths'] = df['TotalDeaths'].fillna(0).astype(np.int64)
df['TotalRecovered'] = df['TotalRecovered'].fillna(0).astype(np.int64)
df['TotalTests'] = df['TotalTests'].fillna(0).astype(np.int64)
test = cum_data.loc[:,['Country_or_Other', 'TotalCases','TotalTests']]
print (test.dtypes)
Country_or_Other object
TotalCases int64
TotalTests int64
dtype: object
test['TotalCases_Percent'] = 100*test['TotalCases']/test['TotalCases'].sum()
test['TotalTests_Percent'] = 100*test['TotalTests']/test['TotalTests'].sum()
df = test[test['Country_or_Other'] == 'USA']
print (df)
Country_or_Other TotalCases TotalTests TotalCases_Percent \
9 USA 2026493 21725064 28.18313
TotalTests_Percent
9 21.641801
Solution 2:
Solution 1:
Change TotalTests
column dtype to int64
.
Use:
df['TotalTests'] = df['TotalTests'].fillna(0).astype('int64') ## int64
instead of:
df['TotalTests'] = df['TotalTests'].fillna(0).astype(int) ## int32
Why?
Mathematically, after multiplying the TotalTests
value for USA
by 100, its value should be 2 172 506 400
which is larger than int32
maximum value 2 147 483 648
. So, it behaved weirdly. Changing its type to int64
provides much higher maximum value.
Generally, it is preferred to use int64
in all columns that have large values and about to reach int32
max value.
Solution 2 (naive):
Multiply by 100 after division to avoid exceeding the maximum value at any point:
test['TotalTests_Percent'] = (test['TotalTests']/test['TotalTests'].sum())*100
Post a Comment for "Strange Behaviour. A Single Negative Value Generated In Pandas Dataframe, When New Column Is Created"