Insert Multiple Documents In Elasticsearch - Bulk Doc Formatter
TLDR; How can I bulk format my JSON file for ingestion to Elasticsearch? I am attempting to ingest some NOAA data into Elasticsearch and have been utilizing NOAA Python SDK. I hav
Solution 1:
How about utilizing the bulk
method of the official python client?
import json
from noaa_sdk import noaa
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()['features']
es = Elasticsearch()
defsave_alerts():
withopen('nhc_alerts.json', 'w') as f:
f.write(json.dumps(alerts))
defbulk_sync():
actions = [
{
"_index": "my_noaa_index",
"_source": alert
} for alert in alerts
]
bulk(es, actions)
save_alerts()
bulk_sync()
Solution 2:
The problem is that the alerts
JSON dump is all on a single line, so it's not gonna work as it is. You need to iterate on all the alerts (I suspect whatever is in the alerts.features
array) and do it all in one shot without going through an intermediary file, like this:
n = noaa.NOAA()
alerts = n.alerts()
f = open('nhc_alerts.json', 'w')
for alert in alerts['features']:
f.write('%s\n' % json.dumps({'index': {}}));
f.write('%s\n' % json.dumps(alert, indent=0).replace('\n', ''))
f.write('\n')
Solution 3:
I suspect this line to result in an error later on json.dumps(json_in.read())
. json.dumps
returns a string. When you iterate over a string, like you do in the next line, then you iterate over the chars.
I think what you actually want is the following. It saves every feature
of alert["features“]
as a new line in json format.
from noaa_sdk import noaa
import json
from pathlib import Path
noaa_client = noaa.NOAA()
alerts = noaa_client.alerts()
save_path = Path('.') / "alert.json"with save_path.open("a") as f:
for feature in alerts["features"]:
json.dump(feature, f)
f.write("\n")
Result:
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-KEEPALIVE-16211", "@type": "wx:Alert", "id": "NWS-IDP-PROD-KEEPALIVE-16211", "areaDesc": "Montgomery", "geocode": {"UGC": ["MDC031"], "SAME": ["024031"]}, "affectedZones": ["https://api.weather.gov/zones/county/MDC031"], "references": [], "sent": "2020-05-06T16:55:56+00:00", "effective": "2020-05-06T16:55:56+00:00", "onset": null, "expires": "2020-05-06T17:05:56+00:00", "ends": null, "status": "Test", "messageType": "Alert", "category": "Met", "severity": "Unknown", "certainty": "Unknown", "urgency": "Unknown", "event": "Test Message", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS", "headline": null, "description": "Monitoring message only. Please disregard.", "instruction": "Monitoring message only. Please disregard.", "response": "None", "parameters": {"PIL": ["NWSKEPWBC"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197938-3548807", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197938-3548807", "areaDesc": "Coastal waters from NC VA border to Currituck Beach Light NC out 20 nm; Coastal Waters from Cape Charles Light to Virginia-North Carolina border out to 20 nm", "geocode": {"UGC": ["ANZ658", "ANZ656"], "SAME": ["073658", "073656"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ658", "https://api.weather.gov/zones/forecast/ANZ656"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197751-3548667", "identifier": "NWS-IDP-PROD-4197751-3548667", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197640-3548624", "identifier": "NWS-IDP-PROD-4197640-3548624", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197422-3548452", "identifier": "NWS-IDP-PROD-4197422-3548452", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-07T04:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...Northwest winds 15 to 20 kt with gusts up to 25 kt and\nseas 3 to 5 ft expected.\n\n* WHERE...Coastal Waters from Cape Charles Light to Virginia-\nNorth Carolina border out to 20 nm and Coastal waters from NC\nVA border to Currituck Beach Light NC out 20 nm.\n\n* WHEN...From 4 AM to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 4 AM TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0800Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
{"id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "type": "Feature", "geometry": null, "properties": {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197936-3548805", "@type": "wx:Alert", "id": "NWS-IDP-PROD-4197936-3548805", "areaDesc": "Chesapeake Bay from Smith Point to Windmill Point VA; Chesapeake Bay from New Point Comfort to Little Creek VA; Chesapeake Bay from Windmill Point to New Point Comfort VA; Chesapeake Bay from Little Creek VA to Cape Henry VA including the Chesapeake Bay Bridge Tunnel", "geocode": {"UGC": ["ANZ630", "ANZ632", "ANZ631", "ANZ634"], "SAME": ["073630", "073632", "073631", "073634"]}, "affectedZones": ["https://api.weather.gov/zones/forecast/ANZ630", "https://api.weather.gov/zones/forecast/ANZ632", "https://api.weather.gov/zones/forecast/ANZ631", "https://api.weather.gov/zones/forecast/ANZ634"], "references": [{"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197423-3548453", "identifier": "NWS-IDP-PROD-4197423-3548453", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T03:25:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197750-3548666", "identifier": "NWS-IDP-PROD-4197750-3548666", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T09:51:00-04:00"}, {"@id": "https://api.weather.gov/alerts/NWS-IDP-PROD-4197641-3548625", "identifier": "NWS-IDP-PROD-4197641-3548625", "sender": "w-nws.webmaster@noaa.gov", "sent": "2020-05-06T06:35:00-04:00"}], "sent": "2020-05-06T12:54:00-04:00", "effective": "2020-05-06T12:54:00-04:00", "onset": "2020-05-06T22:00:00-04:00", "expires": "2020-05-06T21:00:00-04:00", "ends": "2020-05-07T13:00:00-04:00", "status": "Actual", "messageType": "Update", "category": "Met", "severity": "Minor", "certainty": "Likely", "urgency": "Expected", "event": "Small Craft Advisory", "sender": "w-nws.webmaster@noaa.gov", "senderName": "NWS Wakefield VA", "headline": "Small Craft Advisory issued May 6 at 12:54PM EDT until May 7 at 1:00PM EDT by NWS Wakefield VA", "description": "* WHAT...North winds 10 to 20 kt with gusts up to 25 kt and\nwaves 2 to 3 ft expected.\n\n* WHERE...Chesapeake Bay from Little Creek VA to Cape Henry VA\nincluding the Chesapeake Bay Bridge Tunnel, Chesapeake Bay\nfrom New Point Comfort to Little Creek VA, Chesapeake Bay from\nSmith Point to Windmill Point VA and Chesapeake Bay from\nWindmill Point to New Point Comfort VA.\n\n* WHEN...From 10 PM this evening to 1 PM EDT Thursday.\n\n* IMPACTS...Conditions will be hazardous to small craft.", "instruction": "Inexperienced mariners, especially those operating smaller\nvessels, should avoid navigating in hazardous conditions.", "response": "Avoid", "parameters": {"NWSheadline": ["SMALL CRAFT ADVISORY REMAINS IN EFFECT FROM 10 PM THIS EVENING TO 1 PM EDT THURSDAY"], "VTEC": ["/O.CON.KAKQ.SC.Y.0054.200507T0200Z-200507T1700Z/"], "PIL": ["AKQMWWAKQ"], "BLOCKCHANNEL": ["CMAS", "EAS", "NWEM"], "eventEndingTime": ["2020-05-07T13:00:00-04:00"]}}}
...
Post a Comment for "Insert Multiple Documents In Elasticsearch - Bulk Doc Formatter"