P3 - Data Wrangling with MongoDB

OpenStreetMap Project Data Wrangling with MongoDB

Gangadhara Naga Sai

Data used -<a href=https://mapzen.com/metro-extracts/> MapZen Weekly OpenStreetMaps Metro Extracts</a>

Map Areas: These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And my dream is to do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.

All the files can be downloaded from:



**1. Problems Encountered**

  • Some of names were in different Languages so ,i had to filter out them and select english names for both maps Hoodi and Bunkyo
  • Street names with different types of abbreviations. (i.e. 'Clark Ave SE' or 'Eubank Northeast Ste E-18')
  • Postcodes are standardized form various formats to single format using regex
  • Two cities have to be accessed from one database

Names in Different Language

Different regions have different languages ,and we find that someof names were in different language which are filltered to get only english names. Which would check weather the charecters belong to ascii or not

In [1]:
def isEnglish(string):
    try:
        string.encode('ascii')
    except UnicodeEncodeError:
        return False
    else:
        return True

Over-­abbreviated Names

Since the most of data being manually uploaded, there are lot of abbreviations in street names,locality names. Where they are filtered and replaced with full names.

In [ ]:
#the city below can be hoodi or bunkyo
for st_type, ways in city_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        if name != better_name:
            print name, "=>", better_name
In [ ]:
#few examples 
Bunkyo:
Meidai Jr. High Sch. => Meidai Junior High School
St. Mary's Cathedral => Saint Mary's Cathedral
Shinryukei brdg. E. => Shinryukei Bridge East
Iidabashi Sta. E. => Iidabashi Station East
...

Hoodi:
St. Thomas School => Saint Thomas School
Opp. Jagrithi Apartment => Opposite Jagrithi Apartment
...

Correcting Postcodes

Since various number users upload the data their might be a deviation of the format followed by each individual

In [ ]:
def clean_postal_code(postcode,city):
    post=postcode
    if city == "hoodi" and len(re.sub('[\s,+]' ,'', postcode))==6:
        if  re.match(r'^\d{6}$',postcode) :
            post=postcode
        else:
            post= re.sub('[\s,+]' ,'', postcode)
    elif city =="bunkyo" and len(re.sub('[^A-Za-z0-9]+' ,'', postcode))==7:
        if re.match(r'^([0-9]){3}[-]([0-9]){4}$',postcode):
            post=postcode
        else:
            b=re.sub('[^A-Za-z0-9]+' ,'', postcode)
            post= re.sub(r'(\d\d\d)(\d\d\d\d)', r'\1-\2', b)
        
    return  post
In [ ]:
#the above code will do parsing of postcode using regex
#few examples are given below
Bunkyo:
#Standard format of japan is NNN-NNNN(n being number)
postcode  => cleaned postcode
T123-3657 => 123-3657
123,4568 => 123-4568
1010004 => 101-0004

Hoodi:
#Standard format of japan is NNNNNN(n being number)
postcode  => cleaned postcode
5600 48 => 560048
5600,66 => 560066
560078, => 560078

From the above examples we got an idea how the postcode were cleaned

Merging Both cities

These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And one day i want do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.

I need to add a tag named "city" so i can differentiate them from the database.


2. Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them.

File Sizes

In [ ]:
bangalore.osm -40MB
bangalore.osm.json-51MB
tokyo1.osm- 82MB
tokyo1.osm.json-102.351MB

Number of documents

In [6]:
print "bunkyo:",mongo_db.cities.find({'city':'bunkyo'}).count()
print "hoodi:",mongo_db.cities.find({'city':'hoodi'}).count()
In [ ]:
bunkyo 6706454
hoodi 3339210

Number of node nodes.

In [ ]:
print "Bunkyo:",mongo_db.cities.find({"type":"node",
                                    'city':'bunkyo'}).count()
print "Hoodi:",mongo_db.cities.find({"type":"node",
                                    'city':'hoodi'}).count()
In [ ]:
bunkyo 5606240
hoodi 2744310

Number of way nodes.

In [1]:
print "bunkyo:",mongo_db.cities.find({'type':'way',
                                  'city':'bunkyo'}).count()
print "hoodi:",mongo_db.cities.find({'type':'way',
                                  'city':'hoodi'}).count()
In [1]:
bunkyo 1100214
hoodi 594900

Total Number of contributor.

In [1]:
print "Constributors:", len(mongo_db.cities.distinct("created.user"))
In [1]:
Contributors: 858

3. Additional Data Exploration using MongoDB

I am going to use the pipeline function to retrive data from the database

In [ ]:
def pipeline(city):
    p= [{"$match":{"created.user":{"$exists":1},
                                          "city":city}},
                 {"$group": {"_id": {"City":"$city",
                                     "User":"$created.user"},
                            "contribution": {"$sum": 1}}},                            
                 {"$project": {'_id':0,
                               "City":"$_id.City",
                               "User_Name":"$_id.User",
                               "Total_contribution":"$contribution"}},
                 {"$sort": {"Total_contribution": -1}},
                 {"$limit" : 5 }]
    return p
result1 =mongo_db["cities"].aggregate(pipeline('bunkyo'))
for each in result1:    
    print(each)
print("\n")
result2 =mongo_db["cities"].aggregate(pipeline('hoodi'))
for each in result2:    
    print(each)
In [ ]:
Bunkyo:
{u'City': u'bunkyo', u'User_Name': u'kurauchi', u'Total_contribution': 667425}
{u'City': u'bunkyo', u'User_Name': u'watao', u'Total_contribution': 216855}
{u'City': u'bunkyo', u'User_Name': u'higa4', u'Total_contribution': 40845}
{u'City': u'bunkyo', u'User_Name': u'ikiya', u'Total_contribution': 37287}
{u'City': u'bunkyo', u'User_Name': u'javbw', u'Total_contribution': 26655}

Hoodi:
{u'City': u'hoodi', u'User_Name': u'praveeng', u'Total_contribution': 75153}
{u'City': u'hoodi', u'User_Name': u'akhilsai', u'Total_contribution': 70509}
{u'City': u'hoodi', u'User_Name': u'anthony1', u'Total_contribution': 52194}
{u'City': u'hoodi', u'User_Name': u'anushapyata', u'Total_contribution': 45540}
{u'City': u'hoodi', u'User_Name': u'docaneesh', u'Total_contribution': 38703}

The top contributors for hoodi are no where near since bunkyo being a more compact region than hoodi ,there are more places to contribute.


To get the top Amenities in Hoodi and Bunkyo

I will be showing the pipeline that will go in the above mentioned "Pipleline" function

In [ ]:
pipeline=[{"$match":{"Additional Information.amenity":{"$exists":1},
                                          "city":city}},
                 {"$group": {"_id": {"City":"$city",
                                    "Amenity":"$Additional Information.amenity"},
                            "count": {"$sum": 1}}},
                 {"$project": {'_id':0,
                               "City":"$_id.City",
                               "Amenity":"$_id.Amenity",
                               "Count":"$count"}},
                 {"$sort": {"Count": -1}},
                 {"$limit" : 10 }]
In [ ]:
Bunkyo:
{u'Count': 1587, u'City': u'bunkyo', u'Amenity': u'parking'}
{u'Count': 1497, u'City': u'bunkyo', u'Amenity': u'restaurant'}
{u'Count': 933, u'City': u'bunkyo', u'Amenity': u'cafe'}
{u'Count': 792, u'City': u'bunkyo', u'Amenity': u'fast_food'}
{u'Count': 723, u'City': u'bunkyo', u'Amenity': u'school'}
{u'Count': 606, u'City': u'bunkyo', u'Amenity': u'place_of_worship'}
{u'Count': 525, u'City': u'bunkyo', u'Amenity': u'vending_machine'}
{u'Count': 507, u'City': u'bunkyo', u'Amenity': u'bench'}
{u'Count': 354, u'City': u'bunkyo', u'Amenity': u'pub'}
{u'Count': 342, u'City': u'bunkyo', u'Amenity': u'kindergarten'}

Hoodi:
{u'Count': 528, u'City': u'hoodi', u'Amenity': u'restaurant'}
{u'Count': 216, u'City': u'hoodi', u'Amenity': u'school'}
{u'Count': 189, u'City': u'hoodi', u'Amenity': u'atm'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'pharmacy'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'parking'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'place_of_worship'}
{u'Count': 141, u'City': u'hoodi', u'Amenity': u'bank'}
{u'Count': 141, u'City': u'hoodi', u'Amenity': u'fast_food'}
{u'Count': 138, u'City': u'hoodi', u'Amenity': u'hospital'}
{u'Count': 102, u'City': u'hoodi', u'Amenity': u'cafe'}

As compared to hoodi ,bunkyo have few atms,And parking can be commonly found in bunkyo locality


In [ ]:
    p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                  "Additional Information.amenity":"place_of_worship",
                                  "city":city}},
                       {"$group":{"_id": {"City":"$city",
                                          "Religion":"$Additional Information.religion"},
                                  "count":{"$sum":1}}},
                       {"$project":{"_id":0,
                                    "City":"$_id.City",
                                    "Religion":"$_id.Religion",
                                    "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":6}]
In [ ]:
Bunkyo: 
{u'Count': 303, u'City': u'bunkyo', u'Religion': u'buddhist'}
{u'Count': 132, u'City': u'bunkyo', u'Religion': u'shinto'}
{u'Count': 123, u'City': u'bunkyo'}
{u'Count': 39, u'City': u'bunkyo', u'Religion': u'christian'}
{u'Count': 3, u'City': u'bunkyo', u'Religion': u'muslim'}
{u'Count': 3, u'City': u'bunkyo', u'Religion': u'confucian'}

Hoodi:
{u'Count': 90, u'City': u'hoodi', u'Religion': u'hindu'}
{u'Count': 30, u'City': u'hoodi', u'Religion': u'christian'}
{u'Count': 24, u'City': u'hoodi'}
{u'Count': 18, u'City': u'hoodi', u'Religion': u'muslim'}

As expected japan is popular with buddism,

but india being a secular country it will be having most of the reglious places of worship,where hinduism being majority


In [ ]:
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"restaurant",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Food":"$Additional Information.cuisine"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Food":"$_id.Food",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":6}]
In [ ]:
{u'Count': 582, u'City': u'bunkyo'}
{u'Food': u'japanese', u'City': u'bunkyo', u'Count': 192}
{u'Food': u'chinese', u'City': u'bunkyo', u'Count': 126}
{u'Food': u'italian', u'City': u'bunkyo', u'Count': 69}
{u'Food': u'indian', u'City': u'bunkyo', u'Count': 63}
{u'Food': u'sushi', u'City': u'bunkyo', u'Count': 63}


{u'Count': 213, u'City': u'hoodi'}
{u'Food': u'regional', u'City': u'hoodi', u'Count': 75}
{u'Food': u'indian', u'City': u'hoodi', u'Count': 69}
{u'Food': u'chinese', u'City': u'hoodi', u'Count': 36}
{u'Food': u'international', u'City': u'hoodi', u'Count': 24}
{u'Food': u'Andhra', u'City': u'hoodi', u'Count': 21}

Indian style cusine in Bunkyo seems famous, Which will be better if i go to japan and do my higher studies there.


In [ ]:
 p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"fast_food",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Food":"$Additional Information.cuisine"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Food":"$_id.Food",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":6}]
In [ ]:
{u'Count': 246, u'City': u'bunkyo'}
{u'Food': u'burger', u'City': u'bunkyo', u'Count': 102}
{u'Food': u'ramen', u'City': u'bunkyo', u'Count': 81}
{u'Food': u'japanese', u'City': u'bunkyo', u'Count': 54}
{u'Food': u'noodle', u'City': u'bunkyo', u'Count': 51}
{u'Food': u'noodle;ramen', u'City': u'bunkyo', u'Count': 33}


{u'Count': 66, u'City': u'hoodi'}
{u'Food': u'pizza', u'City': u'hoodi', u'Count': 21}
{u'Food': u'indian', u'City': u'hoodi', u'Count': 12}
{u'Food': u'chicken', u'City': u'hoodi', u'Count': 6}
{u'Food': u'ice_cream', u'City': u'hoodi', u'Count': 6}
{u'Food': u'burger', u'City': u'hoodi', u'Count': 6}

Burger seems very popular among japanese in fast foods,i was expecting ramen to be more popular

, but in hoodi pizza is really common,being a metropolitan city.


ATM's near locality

In [ ]:
    p = [{"$match":{"Additional Information.amenity":{"$exists":1},
                                 "Additional Information.amenity":"atm",
                                 "city":city}},
                      {"$group":{"_id":{"City":"$city",
                                        "Name":"$Additional Information.name:en"},
                                 "count":{"$sum":1}}},
                      {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Name":"$_id.Name",
                                  "Count":"$count"}},
                      {"$sort":{"Count":-1}}, 
                      {"$limit":4}]
In [ ]:
{u'Count': 75, u'City': u'bunkyo'}
{u'Count': 6, u'City': u'bunkyo', u'Name': u'JP Bank'}
{u'Count': 6, u'City': u'bunkyo', u'Name': u'Mizuho'}


{u'Count': 177, u'City': u'hoodi'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'Axis bank ATM'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'HDFC'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'Canara'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'State bank ATM'}

There are quite a few ATM in Bunkyo as compared to hoodi


Martial arts or Dojo Center near locality

In [ ]:
##  Martial arts or Dojo Center near locality
import re

pat = re.compile(r'dojo', re.I)
d=mongo_db.cities.aggregate([{"$match":{ "$or": [ { "Additional Information.name": {'$regex': pat}}
                                                   ,{"Additional Information.amenity": {'$regex': pat}}]}}
                            ,{"$group":{"_id":{"City":"$city"
                             , "Sport":"$Additional Information.name"}}}])
for each in d:    
    print(each)
In [ ]:
bunkyo: 
{u'_id': {u'City': u'bunkyo', u'Sport': u'Aikikai Hombu Dojo'}}
{u'_id': {u'City': u'bunkyo', u'Sport': u'Kodokan Dojo'}}

hoodi:
{u'_id': {u'City': u'hoodi', u'Sport': u"M S Gurukkal's Kalari Academy"}}

I wanted to learn martial arts , In japan is known for its akido and other ninjistsu martial arts , where i can find some in bunkyo Where as in hoodi,india Kalaripayattu Martial Arts are one of the ancient arts that ever existed.


In [ ]:
 p = [{"$match":{"Additional Information.shop":{"$exists":1},
                                          "city":city}},
                       {"$group":{"_id":{"City":"$city",
                                  "Shop":"$Additional Information.shop"},
                           "count":{"$sum":1}}},
                       {"$project": {'_id':0,
                                     "City":"$_id.City",
                                     "Shop":"$_id.Shop",
                                     "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":10}]
In [ ]:
{u'Shop': u'convenience', u'City': u'bunkyo', u'Count': 1035}
{u'Shop': u'clothes', u'City': u'bunkyo', u'Count': 282}
{u'Shop': u'books', u'City': u'bunkyo', u'Count': 225}
{u'Shop': u'mobile_phone', u'City': u'bunkyo', u'Count': 186}
{u'Shop': u'confectionery', u'City': u'bunkyo', u'Count': 156}
{u'Shop': u'supermarket', u'City': u'bunkyo', u'Count': 150}
{u'Shop': u'computer', u'City': u'bunkyo', u'Count': 126}
{u'Shop': u'hairdresser', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'electronics', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'anime', u'City': u'bunkyo', u'Count': 90}


{u'Shop': u'clothes', u'City': u'hoodi', u'Count': 342}
{u'Shop': u'supermarket', u'City': u'hoodi', u'Count': 129}
{u'Shop': u'bakery', u'City': u'hoodi', u'Count': 120}
{u'Shop': u'shoes', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'furniture', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'sports', u'City': u'hoodi', u'Count': 66}
{u'Shop': u'electronics', u'City': u'hoodi', u'Count': 60}
{u'Shop': u'beauty', u'City': u'hoodi', u'Count': 54}
{u'Shop': u'car', u'City': u'hoodi', u'Count': 36}
{u'Shop': u'convenience', u'City': u'hoodi', u'Count': 36}

The general stores are quite common in both the places

In [ ]:
    p = [{"$match":{"Additional Information.shop":{"$exists":1},
                           "city":city,
                           "Additional Information.shop":"supermarket"}},
                       {"$group":{"_id":{"City":"$city",
                                  "Supermarket":"$Additional Information.name"},
                           "count":{"$sum":1}}},
                       {"$project": {'_id':0,
                                     "City":"$_id.City",
                                     "Supermarket":"$_id.Supermarket",
                                     "Count":"$count"}},
                       {"$sort":{"Count":-1}},
                       {"$limit":5}]
In [ ]:
{u'Count': 120, u'City': u'bunkyo'}
{u'Count': 9, u'City': u'bunkyo', u'Supermarket': u'Maruetsu'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u"Y's Mart"}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'SainE'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'DAIMARU Peacock'}


{u'Count': 9, u'City': u'hoodi', u'Supermarket': u'Reliance Fresh'}
{u'Count': 9, u'City': u'hoodi'}
{u'Count': 6, u'City': u'hoodi', u'Supermarket': u"Nilgiri's"}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Royal Mart Supermarket'}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Safal'}

These are few common supermarket brands in both the cities And Nilgiris is like 500 meters away from my home.


4. Conclusion

After such a investigation on this data i think i have become familiar with bunkyo region.

I was expecting a difficulty in merging both the cities data into a single database ,but seem a simple key like city could differentiate them.

There might be even robust cleaning algorithms to a better and clean database,as most of the data is from gps that goes into OpenStreetMap.org. Which needed to be regularly cleaned.

Improvising the data set:

I have observed due to language diferences there is lot more information that is not accessable, So there must be a integration between google translate and openstreetmaps to translate the uploaded information to universal language like english.More processesing power would be required to translate all the data, and a dedicted team is required to trouble shoot if any prolem occurs.

We can use social networking as people are obsessed with uploading pictures in facebook ot twitter they often tag the location where they are ,and these information should be integrated with openstreet maps , where we can draw intresting conlusions.

A standardized format should be followed by every user at the time of uploading the data, Where for example postcodes are having a standard format for their respective countries , so at the time of uploading itself openstreet map should ask for which country(like a drop down menu) and accept only if postcode is put in format.

If the data is collected from several apps that collect data through gps before entering data into openstreetmap a parser should check the spelling of non name entries. We can also use machine learning techniques to learn the pattern and select a standard pattern and parse the data. It would be diffcult to build a model for the huge data available, but the more data available the model can effectively generalize the data

From the comparision of both the cities these are qiute similar and bunkyo region interests me even more to pursue higher studies.