numpy - Similarity Measure/Matrix for data (recommender system)- Python -

- February 15, 2015

i new machine learning , trying try out following problem. input 2 arrays of descriptions same length, , output array of similarity scores of first string first array compared first string in second array etc.

each item in array(numpy array) string of description. can write function find out how similar between 2 strings calculating how many identical , co-occurring word ids there are, , assign score (one possible weight can based on frequency of co-occurrence vs sum of frequency of individual word id). apply function 2 arrays array of scores. please let me know if there other approaches want to consider well. thanks!

data:

array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',        '18/19/20/21/22/23/24/25',        '26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',        '5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',        '57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',        '70/71/72/73/74/75/76/77',        '78/79/80/81/82/83/84/85/86/87/88/89/90/91',        '33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',        '104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',        '117/118/119/120/121/12/122/123/124/125',        '14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',        '137/138/139/140/141/142',        '143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',        '160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',        '172/173/174/175/176/177/73/178/104/179/180/179/181/173',        '182/144/183/179/73',        '184/163/68/185/163/8/186/187/188/54/189/190/191',        '181/192/0/1/193/194/22/195',        '113/196/197/198/68/199/68/200/201/202/203/201',        '204/205/206/207/208/209/68/200',        '163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',        '220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',        '214/228/5/6/5/215/228/228/229',        '230/231/232/233/122/215/128/214/128/234/234',        '235/236/191/237/92/93/238/239',        '13/14/44/44/240/241/242/49/54/243/244/245/55/56',        '220/21/246/38/247/201/248/73/160/249/250/203/201',        '214/49/251/252/253/254/255/256/257/258'],        dtype='|s127')  array(['151/308/309/310/311/215/312/160/313/214/49/12',        '314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',        '324/325/62/220/326/194/327/328/218/76/241/329',        '330/29/22/103/331/314/68/80/49',        '78/332/85/96/97/227/333/4/334/188',        '57/335/336/34/187/337/21/338/212/213/339/340',        '341/342/167/343/8/254/154/61/344',        '2/292/345/346/42/347/348/348/100/349/202/161/263',        '283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',        '137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',        '23/363/10/364/289/68/123/354/355',        '188/28/365/149/366/98/367/368/369/370/371/372/368',        '373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',        '179/376/377/378/179/87/88/379/20',        '380/85/381/333/382/215/128/383/384', '385/129/386/387/388',        '389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',        '397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',        '77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',        '129/295/90/259/38/39/119/414/415/416/14/318/417/418',        '419/420/421/422/423/23/424/241/421/425/58',        '426/244/427/5/428/49/76/429/430/431',        '257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',        '439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',        '385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],        dtype='|s127')

the following code should facilitate need in python 3.x

import numpy np collections import counter  def jaccardsim(c1, c2):     cu = c1 | c2     ci = c1 & c2     sim = sum(ci.values()) / sum(cu.values())     return sim  def bytearraysim(b1, b2):     ca = [counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))           in range(len(b1))]     cb = [counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))           in range(len(b2))]      # assuming both 'a' , 'b' in same length     csim = [jaccardsim(ca[i], cb[i]) in range(len(a))]      return csim # array of similarities

jaccard similarity score used in implementation. may other scores, such cosine or hamming, liking.

assuming arrays stored in variables a , b, resulting function bytearraysim(a,b) outputs following similarity scores:

[0.0,  0.0,  0.0,  0.038461538461538464,  0.0,  0.041666666666666664,  0.0,  0.0,  0.0,  0.08,  0.0,  0.05555555555555555,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.058823529411764705,  0.0,  0.0,  0.0,  0.05555555555555555,  0.0,  0.0,  0.0,  0.0,  0.0]

Search This Blog

Swift

numpy - Similarity Measure/Matrix for data (recommender system)- Python -

Comments

Post a Comment

Popular posts from this blog

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

jsf - "PropertyNotWritableException: Illegal Syntax for Set Operation" error when setting value in bean -

arrays - Algorithm to find ideal starting spot in a circle -