Flatten Nested Json (dict, List) Into List To Prepare To Write Into Db

February 17, 2024 Post a Comment

I am still working on a problem to flatten a nested JSON file. The nested items are either List or Dict: Here is the file I want to flatten (Unlike in my previous post, I kept it a

Solution 1:

Here is my solution for splitObj

defsplitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix isNoneor prefix=="NotFlat"else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

cL = 0
cD = 0# try to find the key holding the subobject or a list of subobjectsfor k, v in new.items():
    #Determine the number of lists in vifisinstance(v, list):
        cL += 1#Determine the number of dict in velifisinstance(v, dict):
        cD += 1for k, v in new.items():
    # list of subobjectsifisinstance(v, list):
        if (cD+cL) <=1:
            try:
                type(v[0])
            except IndexError:
                v = [""]
            ifnotisinstance(v[0], str):
                del new[k]
                return new, k, v
            elifisinstance(v[0], str):
                #handle list when only containing strings, return, the whole thing#solve other dicts which might be in the line#use "NotFlat" to run loop again but without adding a prefix

                new[k] = ", ".join(v)
                return new, None, Noneelse:
                custLog.logger.info("")
        elif (cD+cL) >1:

            #print("Count List2 CD: "+str(cD))#print("Count LIST2 CL: "+str(cL))#if list is emptytry:
                type(v[0])
            except IndexError:
                v = [""]

            ifnotisinstance(v[0], str):
                del new[k]
                for x in flatten([new]):
                    newOut = x
                    breakreturn newOut, k, v
            elifisinstance(v[0], str):
                #handle list when only containing strings, return, the whole thing#solve other dicts which might be in the line#use "NotFlat" to run loop again but without adding a prefix
                new[k] = ", ".join(v)
                returnNone, "NotFlat", [new]
            else:
                custLog.logger.error("weder noch 2")

    # or just one subobjectelifisinstance(v, dict):
        if (cD+cL) <=1:
            del new[k]
            return new, k, [v]
        elif (cD+cL) >1:
            del new[k]
            for x in flatten([new]):
                newOut = x
                breakreturn newOut, k, [v]
return new, None, None

and here for flatten

defflatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''# iterate all itemsfor item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)
        if subs isNone:
            if key isNone:
                yield flat
                continue# just return fully flat objectsif key isNoneand flat isnotNone:
            yield flat
            continue# otherwise recursively flatten the subobjectstry:
            for sub in flatten(subs, key):
                if flat isnotNone:
                    sub.update(flat)
                yield sub
        except TypeError as e:
            custLog.logger.error("ERR: TypeError"+str(e))

Solution 2:

While not a generalizable function, consider walking through each nested element for a flat output for database import or flatfile (csv, txt) export. Since json files consist of combinations of dictionaries and lists, handle them accordingly at each level:

items = []
forouterin data:    
    inner= [''] *15for outerk, outerv in outer.items():        
        inner[0] =outer['states'][0]
        inner[1] =outer['niceName']
        inner[2] =outer['id']
        inner[3] =outer['make']['niceName']
        inner[4] =outer['make']['name']
        inner[5] =outer['make']['id']    
        if outerk =='years':            
            for yri inouter[outerk]:                
                for yrk, yrv in yri.items():
                    inner[6] = yri['states'][0] 
                    inner[7] = yri['id'] 
                    inner[8] = yri['year'] 
                    if yrk =='styles':
                        for stylei in yri[yrk]:
                            inner[9] = stylei['trim']
                            inner[10] = stylei['name']
                            inner[11] = stylei['id']
                            inner[12] = stylei['submodel']['body']
                            inner[13] = stylei['submodel']['niceName']
                            inner[14] = stylei['submodel']['modelName']

                            items.append(inner[0:14])

for i in items:        
    print(i)

Output(where parent items repeat for each child)

# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']

Solution 3:

Rethinking the problem

It is often easier to find a solution for a more general problem. So, let's look closer at the problem first.

The input is some JSON file describing a set of objects.

An object is recusively defined as either an atom (strings or numbers) or a dict with object values. Lists are used to represent alternatives (i.e. any element of the list can take the place of the list). For example {a:[1,2]} means that a can be either 1 or 2.

The output should be a list of objects that do not contain any options. Additionally, the objects should be flattened, i.e. should be be dicts whose values are atoms and whose keys are describing the path to the value in the original object.

My solution handles alternatives and flattening separately.

Normalising

The function normalise below takes the input of json.dumps and yields a sequence of dicts. Note that the input and output of normalise have the same semantics and describe the same set of objects. The output is just normalised in the sense that it does contain alternatives only at the top level. Database people would call that denormalised because it is undesirable for relational databases.

Baca Juga

normalise always returns a sequence of objects. normalise is implemented as generator to keep memory usage low.

The following cases are distinguished in normalise.

An atom input means there is only one possibility. So, the atom is yielded (this is like returning a list containing the atom).
A list means alternatives of alternatives. It yields all elements of its normalised input (this is like concatenating lists).
A dict means that we have to consider all combinations of alternatives for the individual keys. So, we return a cartesian product of the alternatives.

Here is the code:

import itertools

defnormalise(x):
    ifisinstance(x, dict):
        keys = x.keys()
        values = (normalise(i) for i in x.values())
        for i in itertools.product(*values):
            yield (dict(zip(keys, i)))
    elifisinstance(x, list):
        #if not x:           # uncomment for "LEFT JOIN" behaviour#    yield Nonefor i in x:
            yieldfrom normalise(i)
    else:
        yield x

This code does not return an object if it contains any empty lists. This is because there is no possible value. This is like SQL "INNER JOIN". From Bert's answer it looks like he wants "LEFT JOIN" behaviour (i.e. some default value). To achieve that just uncomment the two lines.

Pseudo-flattening

The objects yielded by normalise still have the original (nested) dict structure. They could be flattened using code found in other discussions.

However, the OP wants to insert the objects in a database. So, he will most likely not need a list of keys of the flattened dictionary. He will only need a function returning the value for a given path.

This can be achieved by creating a wrapper object to the dict that has a __getitem__ method. This wrapper can also be used to return a default value for non-existing paths.

classDictWrapper:
    def__init__(self, d, sep='.'):
        self.d = d
        self.sep = sep

    def__getitem__(self, key):
        ret = self.d
        try:
            for k in key.split(self.sep):
                ret = ret[k]
            return ret
        except KeyError:
            returnNone

The sql insert can than look like the following (tested with psycopg2)

for i in normalise(input):
    cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))

Implementation details

This implemenation clearly sacrifices some runtime performance for clarity.
Abstract base classes could be used instead of list and dict. However, this might be problematic because str is a sequence but should be treated as atom.
DictWrapper only works properly if sep is not contained in any of the dict keys.
normalise does not filter out duplicates. This could be done by using sets and named tuples instead of lists and dicts. However, this would imply that the whole result has to be in memory. It might be better to filter out duplicates at the database level.
To keep memory usage at a minimum the JSON should be read lazily.

Learn Python Tutorials