Clean One Column From Long And Big Data Set
I am trying to clean only one column from the long and big data sets. The data has 18 columns, more than 10k+ rows about 100s of csv files, Of which I want to clean only one column
Solution 1:
There are many things wrong here.
- The file is not a simple csv and is not being appropriately parsed by your assumed
data = pd.read_csv('input.csv')
. - The 'Coordinates' filed seems to be a
json
string - There are NaN's in that same field
This is what I've done so far. You'll want to do some work on your own parsing this file more appropriately
import pandas as pd
df1 = pd.read_csv('./Turkey_28.csv')
coords = df1[['tweetID', 'Coordinates']].set_index('tweetID')['Coordinates']
coords = coords.dropna().apply(lambda x: eval(x))
coords = coords[coords.apply(type) == dict]
def get_coords(x):
return pd.Series(x['coordinates'], index=['Coordinate_one', 'Coordinate_two'])
coords = coords.apply(get_coords)
df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)
print df2.head(2).T
tweetID 714602054988275712
Coordinate_one 23.2745
Coordinate_two 56.6165
tweetText I'm at MK Appartaments in Dobele https://t.co/...
tweetRetweetCt 0
tweetFavoriteCt 0
tweetSource Foursquare
tweetCreated 2016-03-28 23:56:21
userID 782541481
userScreen MartinsKnops
userName Martins Knops
userCreateDt 2012-08-26 14:24:29
userDesc I See Them Try But They Can't Do What I Do. Be...
userFollowerCt 137
userFriendsCt 164
userLocation DOB Till I Die
userTimezone Casablanca
Coordinates {u'type': u'Point', u'coordinates': [23.274462...
GeoEnabled True
Language en
Solution 2:
10K rows doesn't look at all like Big Data. How many columns do you have?
I don't understand your code, it is broken, but an easy example manipulation:
df = pd.read_cvs('input.csv')
df['tweetID'] = df['tweetID'] + 1 # add 1
df.to_csv('output.csv', index=False)
If your data doesn't fit into memory you might consider using Dask.
Post a Comment for "Clean One Column From Long And Big Data Set"