Flatten Nested Json (dict, List) Into List To Prepare To Write Into Db
Solution 1:
Here is my solution for splitObj
defsplitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix isNoneor prefix=="NotFlat"else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
cL = 0
cD = 0# try to find the key holding the subobject or a list of subobjectsfor k, v in new.items():
#Determine the number of lists in vifisinstance(v, list):
cL += 1#Determine the number of dict in velifisinstance(v, dict):
cD += 1for k, v in new.items():
# list of subobjectsifisinstance(v, list):
if (cD+cL) <=1:
try:
type(v[0])
except IndexError:
v = [""]
ifnotisinstance(v[0], str):
del new[k]
return new, k, v
elifisinstance(v[0], str):
#handle list when only containing strings, return, the whole thing#solve other dicts which might be in the line#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return new, None, Noneelse:
custLog.logger.info("")
elif (cD+cL) >1:
#print("Count List2 CD: "+str(cD))#print("Count LIST2 CL: "+str(cL))#if list is emptytry:
type(v[0])
except IndexError:
v = [""]
ifnotisinstance(v[0], str):
del new[k]
for x in flatten([new]):
newOut = x
breakreturn newOut, k, v
elifisinstance(v[0], str):
#handle list when only containing strings, return, the whole thing#solve other dicts which might be in the line#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
returnNone, "NotFlat", [new]
else:
custLog.logger.error("weder noch 2")
# or just one subobjectelifisinstance(v, dict):
if (cD+cL) <=1:
del new[k]
return new, k, [v]
elif (cD+cL) >1:
del new[k]
for x in flatten([new]):
newOut = x
breakreturn newOut, k, [v]
return new, None, None
and here for flatten
defflatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''# iterate all itemsfor item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
if subs isNone:
if key isNone:
yield flat
continue# just return fully flat objectsif key isNoneand flat isnotNone:
yield flat
continue# otherwise recursively flatten the subobjectstry:
for sub in flatten(subs, key):
if flat isnotNone:
sub.update(flat)
yield sub
except TypeError as e:
custLog.logger.error("ERR: TypeError"+str(e))
Solution 2:
While not a generalizable function, consider walking through each nested element for a flat output for database import or flatfile (csv, txt) export. Since json files consist of combinations of dictionaries and lists, handle them accordingly at each level:
items = []
forouterin data:
inner= [''] *15for outerk, outerv in outer.items():
inner[0] =outer['states'][0]
inner[1] =outer['niceName']
inner[2] =outer['id']
inner[3] =outer['make']['niceName']
inner[4] =outer['make']['name']
inner[5] =outer['make']['id']
if outerk =='years':
for yri inouter[outerk]:
for yrk, yrv in yri.items():
inner[6] = yri['states'][0]
inner[7] = yri['id']
inner[8] = yri['year']
if yrk =='styles':
for stylei in yri[yrk]:
inner[9] = stylei['trim']
inner[10] = stylei['name']
inner[11] = stylei['id']
inner[12] = stylei['submodel']['body']
inner[13] = stylei['submodel']['niceName']
inner[14] = stylei['submodel']['modelName']
items.append(inner[0:14])
for i in items:
print(i)
Output(where parent items repeat for each child)
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']
Solution 3:
Rethinking the problem
It is often easier to find a solution for a more general problem. So, let's look closer at the problem first.
The input is some JSON file describing a set of objects.
An object is recusively defined as either an atom (strings or numbers) or a dict with object values. Lists are used to represent alternatives (i.e. any element of the list can take the place of the list).
For example {a:[1,2]}
means that a
can be either 1
or 2
.
The output should be a list of objects that do not contain any options. Additionally, the objects should be flattened, i.e. should be be dicts whose values are atoms and whose keys are describing the path to the value in the original object.
My solution handles alternatives and flattening separately.
Normalising
The function normalise
below takes the input of json.dumps
and yields a sequence of dicts. Note that the input and output of normalise
have the same semantics and describe the same set of objects. The output is just normalised in the sense that it does contain alternatives only at the top level. Database people would call that denormalised because it is undesirable for relational databases.
normalise
always returns a sequence of objects. normalise
is implemented as generator to keep memory usage low.
The following cases are distinguished in normalise
.
- An atom input means there is only one possibility. So, the atom is yielded (this is like returning a list containing the atom).
- A list means alternatives of alternatives. It yields all elements of its normalised input (this is like concatenating lists).
- A dict means that we have to consider all combinations of alternatives for the individual keys. So, we return a cartesian product of the alternatives.
Here is the code:
import itertools
defnormalise(x):
ifisinstance(x, dict):
keys = x.keys()
values = (normalise(i) for i in x.values())
for i in itertools.product(*values):
yield (dict(zip(keys, i)))
elifisinstance(x, list):
#if not x: # uncomment for "LEFT JOIN" behaviour# yield Nonefor i in x:
yieldfrom normalise(i)
else:
yield x
This code does not return an object if it contains any empty lists. This is because there is no possible value. This is like SQL "INNER JOIN". From Bert's answer it looks like he wants "LEFT JOIN" behaviour (i.e. some default value). To achieve that just uncomment the two lines.
Pseudo-flattening
The objects yielded by normalise
still have the original (nested) dict structure. They could be flattened using code found in other discussions.
However, the OP wants to insert the objects in a database. So, he will most likely not need a list of keys of the flattened dictionary. He will only need a function returning the value for a given path.
This can be achieved by creating a wrapper object to the dict that has a __getitem__
method. This wrapper can also be used to return a default value for non-existing paths.
classDictWrapper:
def__init__(self, d, sep='.'):
self.d = d
self.sep = sep
def__getitem__(self, key):
ret = self.d
try:
for k in key.split(self.sep):
ret = ret[k]
return ret
except KeyError:
returnNone
The sql insert can than look like the following (tested with psycopg2)
for i in normalise(input):
cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))
Implementation details
This implemenation clearly sacrifices some runtime performance for clarity.
Abstract base classes could be used instead of
list
anddict
. However, this might be problematic becausestr
is a sequence but should be treated as atom.DictWrapper
only works properly ifsep
is not contained in any of the dict keys.normalise
does not filter out duplicates. This could be done by using sets and named tuples instead of lists and dicts. However, this would imply that the whole result has to be in memory. It might be better to filter out duplicates at the database level.To keep memory usage at a minimum the JSON should be read lazily.
Post a Comment for "Flatten Nested Json (dict, List) Into List To Prepare To Write Into Db"