Data used -<a href=https://mapzen.com/metro-extracts/> MapZen Weekly OpenStreetMaps Metro Extracts</a>
Map Areas: These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And my dream is to do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.
All the files can be downloaded from:
Different regions have different languages ,and we find that someof names were in different language which are filltered to get only english names. Which would check weather the charecters belong to ascii or not
def isEnglish(string):
try:
string.encode('ascii')
except UnicodeEncodeError:
return False
else:
return True
Since the most of data being manually uploaded, there are lot of abbreviations in street names,locality names. Where they are filtered and replaced with full names.
#the city below can be hoodi or bunkyo
for st_type, ways in city_types.iteritems():
for name in ways:
better_name = update_name(name, mapping)
if name != better_name:
print name, "=>", better_name
#few examples
Bunkyo:
Meidai Jr. High Sch. => Meidai Junior High School
St. Mary's Cathedral => Saint Mary's Cathedral
Shinryukei brdg. E. => Shinryukei Bridge East
Iidabashi Sta. E. => Iidabashi Station East
...
Hoodi:
St. Thomas School => Saint Thomas School
Opp. Jagrithi Apartment => Opposite Jagrithi Apartment
...
Since various number users upload the data their might be a deviation of the format followed by each individual
def clean_postal_code(postcode,city):
post=postcode
if city == "hoodi" and len(re.sub('[\s,+]' ,'', postcode))==6:
if re.match(r'^\d{6}$',postcode) :
post=postcode
else:
post= re.sub('[\s,+]' ,'', postcode)
elif city =="bunkyo" and len(re.sub('[^A-Za-z0-9]+' ,'', postcode))==7:
if re.match(r'^([0-9]){3}[-]([0-9]){4}$',postcode):
post=postcode
else:
b=re.sub('[^A-Za-z0-9]+' ,'', postcode)
post= re.sub(r'(\d\d\d)(\d\d\d\d)', r'\1-\2', b)
return post
#the above code will do parsing of postcode using regex
#few examples are given below
Bunkyo:
#Standard format of japan is NNN-NNNN(n being number)
postcode => cleaned postcode
T123-3657 => 123-3657
123,4568 => 123-4568
1010004 => 101-0004
Hoodi:
#Standard format of japan is NNNNNN(n being number)
postcode => cleaned postcode
5600 48 => 560048
5600,66 => 560066
560078, => 560078
From the above examples we got an idea how the postcode were cleaned
These two maps are selected since ,right now i am living at Hoodi,Bengaluru. And one day i want do my masters in japan in robotics,so i had selected locality of University of tokyo, Bunkyo.I really wanted to explore differences between the regions.
I need to add a tag named "city" so i can differentiate them from the database.
This section contains basic statistics about the dataset and the MongoDB queries used to gather them.
bangalore.osm -40MB
bangalore.osm.json-51MB
tokyo1.osm- 82MB
tokyo1.osm.json-102.351MB
print "bunkyo:",mongo_db.cities.find({'city':'bunkyo'}).count()
print "hoodi:",mongo_db.cities.find({'city':'hoodi'}).count()
bunkyo 6706454
hoodi 3339210
print "Bunkyo:",mongo_db.cities.find({"type":"node",
'city':'bunkyo'}).count()
print "Hoodi:",mongo_db.cities.find({"type":"node",
'city':'hoodi'}).count()
bunkyo 5606240
hoodi 2744310
print "bunkyo:",mongo_db.cities.find({'type':'way',
'city':'bunkyo'}).count()
print "hoodi:",mongo_db.cities.find({'type':'way',
'city':'hoodi'}).count()
bunkyo 1100214
hoodi 594900
print "Constributors:", len(mongo_db.cities.distinct("created.user"))
Contributors: 858
I am going to use the pipeline function to retrive data from the database
def pipeline(city):
p= [{"$match":{"created.user":{"$exists":1},
"city":city}},
{"$group": {"_id": {"City":"$city",
"User":"$created.user"},
"contribution": {"$sum": 1}}},
{"$project": {'_id':0,
"City":"$_id.City",
"User_Name":"$_id.User",
"Total_contribution":"$contribution"}},
{"$sort": {"Total_contribution": -1}},
{"$limit" : 5 }]
return p
result1 =mongo_db["cities"].aggregate(pipeline('bunkyo'))
for each in result1:
print(each)
print("\n")
result2 =mongo_db["cities"].aggregate(pipeline('hoodi'))
for each in result2:
print(each)
Bunkyo:
{u'City': u'bunkyo', u'User_Name': u'kurauchi', u'Total_contribution': 667425}
{u'City': u'bunkyo', u'User_Name': u'watao', u'Total_contribution': 216855}
{u'City': u'bunkyo', u'User_Name': u'higa4', u'Total_contribution': 40845}
{u'City': u'bunkyo', u'User_Name': u'ikiya', u'Total_contribution': 37287}
{u'City': u'bunkyo', u'User_Name': u'javbw', u'Total_contribution': 26655}
Hoodi:
{u'City': u'hoodi', u'User_Name': u'praveeng', u'Total_contribution': 75153}
{u'City': u'hoodi', u'User_Name': u'akhilsai', u'Total_contribution': 70509}
{u'City': u'hoodi', u'User_Name': u'anthony1', u'Total_contribution': 52194}
{u'City': u'hoodi', u'User_Name': u'anushapyata', u'Total_contribution': 45540}
{u'City': u'hoodi', u'User_Name': u'docaneesh', u'Total_contribution': 38703}
The top contributors for hoodi are no where near since bunkyo being a more compact region than hoodi ,there are more places to contribute.
I will be showing the pipeline that will go in the above mentioned "Pipleline" function
pipeline=[{"$match":{"Additional Information.amenity":{"$exists":1},
"city":city}},
{"$group": {"_id": {"City":"$city",
"Amenity":"$Additional Information.amenity"},
"count": {"$sum": 1}}},
{"$project": {'_id':0,
"City":"$_id.City",
"Amenity":"$_id.Amenity",
"Count":"$count"}},
{"$sort": {"Count": -1}},
{"$limit" : 10 }]
Bunkyo:
{u'Count': 1587, u'City': u'bunkyo', u'Amenity': u'parking'}
{u'Count': 1497, u'City': u'bunkyo', u'Amenity': u'restaurant'}
{u'Count': 933, u'City': u'bunkyo', u'Amenity': u'cafe'}
{u'Count': 792, u'City': u'bunkyo', u'Amenity': u'fast_food'}
{u'Count': 723, u'City': u'bunkyo', u'Amenity': u'school'}
{u'Count': 606, u'City': u'bunkyo', u'Amenity': u'place_of_worship'}
{u'Count': 525, u'City': u'bunkyo', u'Amenity': u'vending_machine'}
{u'Count': 507, u'City': u'bunkyo', u'Amenity': u'bench'}
{u'Count': 354, u'City': u'bunkyo', u'Amenity': u'pub'}
{u'Count': 342, u'City': u'bunkyo', u'Amenity': u'kindergarten'}
Hoodi:
{u'Count': 528, u'City': u'hoodi', u'Amenity': u'restaurant'}
{u'Count': 216, u'City': u'hoodi', u'Amenity': u'school'}
{u'Count': 189, u'City': u'hoodi', u'Amenity': u'atm'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'pharmacy'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'parking'}
{u'Count': 162, u'City': u'hoodi', u'Amenity': u'place_of_worship'}
{u'Count': 141, u'City': u'hoodi', u'Amenity': u'bank'}
{u'Count': 141, u'City': u'hoodi', u'Amenity': u'fast_food'}
{u'Count': 138, u'City': u'hoodi', u'Amenity': u'hospital'}
{u'Count': 102, u'City': u'hoodi', u'Amenity': u'cafe'}
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
"Additional Information.amenity":"place_of_worship",
"city":city}},
{"$group":{"_id": {"City":"$city",
"Religion":"$Additional Information.religion"},
"count":{"$sum":1}}},
{"$project":{"_id":0,
"City":"$_id.City",
"Religion":"$_id.Religion",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
Bunkyo:
{u'Count': 303, u'City': u'bunkyo', u'Religion': u'buddhist'}
{u'Count': 132, u'City': u'bunkyo', u'Religion': u'shinto'}
{u'Count': 123, u'City': u'bunkyo'}
{u'Count': 39, u'City': u'bunkyo', u'Religion': u'christian'}
{u'Count': 3, u'City': u'bunkyo', u'Religion': u'muslim'}
{u'Count': 3, u'City': u'bunkyo', u'Religion': u'confucian'}
Hoodi:
{u'Count': 90, u'City': u'hoodi', u'Religion': u'hindu'}
{u'Count': 30, u'City': u'hoodi', u'Religion': u'christian'}
{u'Count': 24, u'City': u'hoodi'}
{u'Count': 18, u'City': u'hoodi', u'Religion': u'muslim'}
As expected japan is popular with buddism,
but india being a secular country it will be having most of the reglious places of worship,where hinduism being majority
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
"Additional Information.amenity":"restaurant",
"city":city}},
{"$group":{"_id":{"City":"$city",
"Food":"$Additional Information.cuisine"},
"count":{"$sum":1}}},
{"$project":{"_id":0,
"City":"$_id.City",
"Food":"$_id.Food",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
{u'Count': 582, u'City': u'bunkyo'}
{u'Food': u'japanese', u'City': u'bunkyo', u'Count': 192}
{u'Food': u'chinese', u'City': u'bunkyo', u'Count': 126}
{u'Food': u'italian', u'City': u'bunkyo', u'Count': 69}
{u'Food': u'indian', u'City': u'bunkyo', u'Count': 63}
{u'Food': u'sushi', u'City': u'bunkyo', u'Count': 63}
{u'Count': 213, u'City': u'hoodi'}
{u'Food': u'regional', u'City': u'hoodi', u'Count': 75}
{u'Food': u'indian', u'City': u'hoodi', u'Count': 69}
{u'Food': u'chinese', u'City': u'hoodi', u'Count': 36}
{u'Food': u'international', u'City': u'hoodi', u'Count': 24}
{u'Food': u'Andhra', u'City': u'hoodi', u'Count': 21}
Indian style cusine in Bunkyo seems famous, Which will be better if i go to japan and do my higher studies there.
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
"Additional Information.amenity":"fast_food",
"city":city}},
{"$group":{"_id":{"City":"$city",
"Food":"$Additional Information.cuisine"},
"count":{"$sum":1}}},
{"$project":{"_id":0,
"City":"$_id.City",
"Food":"$_id.Food",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":6}]
{u'Count': 246, u'City': u'bunkyo'}
{u'Food': u'burger', u'City': u'bunkyo', u'Count': 102}
{u'Food': u'ramen', u'City': u'bunkyo', u'Count': 81}
{u'Food': u'japanese', u'City': u'bunkyo', u'Count': 54}
{u'Food': u'noodle', u'City': u'bunkyo', u'Count': 51}
{u'Food': u'noodle;ramen', u'City': u'bunkyo', u'Count': 33}
{u'Count': 66, u'City': u'hoodi'}
{u'Food': u'pizza', u'City': u'hoodi', u'Count': 21}
{u'Food': u'indian', u'City': u'hoodi', u'Count': 12}
{u'Food': u'chicken', u'City': u'hoodi', u'Count': 6}
{u'Food': u'ice_cream', u'City': u'hoodi', u'Count': 6}
{u'Food': u'burger', u'City': u'hoodi', u'Count': 6}
Burger seems very popular among japanese in fast foods,i was expecting ramen to be more popular
, but in hoodi pizza is really common,being a metropolitan city.
p = [{"$match":{"Additional Information.amenity":{"$exists":1},
"Additional Information.amenity":"atm",
"city":city}},
{"$group":{"_id":{"City":"$city",
"Name":"$Additional Information.name:en"},
"count":{"$sum":1}}},
{"$project":{"_id":0,
"City":"$_id.City",
"Name":"$_id.Name",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":4}]
{u'Count': 75, u'City': u'bunkyo'}
{u'Count': 6, u'City': u'bunkyo', u'Name': u'JP Bank'}
{u'Count': 6, u'City': u'bunkyo', u'Name': u'Mizuho'}
{u'Count': 177, u'City': u'hoodi'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'Axis bank ATM'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'HDFC'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'Canara'}
{u'Count': 3, u'City': u'hoodi', u'Name': u'State bank ATM'}
There are quite a few ATM in Bunkyo as compared to hoodi
## Martial arts or Dojo Center near locality
import re
pat = re.compile(r'dojo', re.I)
d=mongo_db.cities.aggregate([{"$match":{ "$or": [ { "Additional Information.name": {'$regex': pat}}
,{"Additional Information.amenity": {'$regex': pat}}]}}
,{"$group":{"_id":{"City":"$city"
, "Sport":"$Additional Information.name"}}}])
for each in d:
print(each)
bunkyo:
{u'_id': {u'City': u'bunkyo', u'Sport': u'Aikikai Hombu Dojo'}}
{u'_id': {u'City': u'bunkyo', u'Sport': u'Kodokan Dojo'}}
hoodi:
{u'_id': {u'City': u'hoodi', u'Sport': u"M S Gurukkal's Kalari Academy"}}
I wanted to learn martial arts , In japan is known for its akido and other ninjistsu martial arts , where i can find some in bunkyo Where as in hoodi,india Kalaripayattu Martial Arts are one of the ancient arts that ever existed.
p = [{"$match":{"Additional Information.shop":{"$exists":1},
"city":city}},
{"$group":{"_id":{"City":"$city",
"Shop":"$Additional Information.shop"},
"count":{"$sum":1}}},
{"$project": {'_id':0,
"City":"$_id.City",
"Shop":"$_id.Shop",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":10}]
{u'Shop': u'convenience', u'City': u'bunkyo', u'Count': 1035}
{u'Shop': u'clothes', u'City': u'bunkyo', u'Count': 282}
{u'Shop': u'books', u'City': u'bunkyo', u'Count': 225}
{u'Shop': u'mobile_phone', u'City': u'bunkyo', u'Count': 186}
{u'Shop': u'confectionery', u'City': u'bunkyo', u'Count': 156}
{u'Shop': u'supermarket', u'City': u'bunkyo', u'Count': 150}
{u'Shop': u'computer', u'City': u'bunkyo', u'Count': 126}
{u'Shop': u'hairdresser', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'electronics', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'anime', u'City': u'bunkyo', u'Count': 90}
{u'Shop': u'clothes', u'City': u'hoodi', u'Count': 342}
{u'Shop': u'supermarket', u'City': u'hoodi', u'Count': 129}
{u'Shop': u'bakery', u'City': u'hoodi', u'Count': 120}
{u'Shop': u'shoes', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'furniture', u'City': u'hoodi', u'Count': 72}
{u'Shop': u'sports', u'City': u'hoodi', u'Count': 66}
{u'Shop': u'electronics', u'City': u'hoodi', u'Count': 60}
{u'Shop': u'beauty', u'City': u'hoodi', u'Count': 54}
{u'Shop': u'car', u'City': u'hoodi', u'Count': 36}
{u'Shop': u'convenience', u'City': u'hoodi', u'Count': 36}
The general stores are quite common in both the places
p = [{"$match":{"Additional Information.shop":{"$exists":1},
"city":city,
"Additional Information.shop":"supermarket"}},
{"$group":{"_id":{"City":"$city",
"Supermarket":"$Additional Information.name"},
"count":{"$sum":1}}},
{"$project": {'_id':0,
"City":"$_id.City",
"Supermarket":"$_id.Supermarket",
"Count":"$count"}},
{"$sort":{"Count":-1}},
{"$limit":5}]
{u'Count': 120, u'City': u'bunkyo'}
{u'Count': 9, u'City': u'bunkyo', u'Supermarket': u'Maruetsu'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u"Y's Mart"}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'SainE'}
{u'Count': 3, u'City': u'bunkyo', u'Supermarket': u'DAIMARU Peacock'}
{u'Count': 9, u'City': u'hoodi', u'Supermarket': u'Reliance Fresh'}
{u'Count': 9, u'City': u'hoodi'}
{u'Count': 6, u'City': u'hoodi', u'Supermarket': u"Nilgiri's"}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Royal Mart Supermarket'}
{u'Count': 3, u'City': u'hoodi', u'Supermarket': u'Safal'}
These are few common supermarket brands in both the cities And Nilgiris is like 500 meters away from my home.
After such a investigation on this data i think i have become familiar with bunkyo region.
I was expecting a difficulty in merging both the cities data into a single database ,but seem a simple key like city could differentiate them.
There might be even robust cleaning algorithms to a better and clean database,as most of the data is from gps that goes into OpenStreetMap.org. Which needed to be regularly cleaned.
I have observed due to language diferences there is lot more information that is not accessable, So there must be a integration between google translate and openstreetmaps to translate the uploaded information to universal language like english.More processesing power would be required to translate all the data, and a dedicted team is required to trouble shoot if any prolem occurs.
We can use social networking as people are obsessed with uploading pictures in facebook ot twitter they often tag the location where they are ,and these information should be integrated with openstreet maps , where we can draw intresting conlusions.
A standardized format should be followed by every user at the time of uploading the data, Where for example postcodes are having a standard format for their respective countries , so at the time of uploading itself openstreet map should ask for which country(like a drop down menu) and accept only if postcode is put in format.
If the data is collected from several apps that collect data through gps before entering data into openstreetmap a parser should check the spelling of non name entries. We can also use machine learning techniques to learn the pattern and select a standard pattern and parse the data. It would be diffcult to build a model for the huge data available, but the more data available the model can effectively generalize the data
From the comparision of both the cities these are qiute similar and bunkyo region interests me even more to pursue higher studies.