numpy - Similarity Measure/Matrix for data (recommender system)- Python -


i new machine learning , trying try out following problem. input 2 arrays of descriptions same length, , output array of similarity scores of first string first array compared first string in second array etc.

each item in array(numpy array) string of description. can write function find out how similar between 2 strings calculating how many identical , co-occurring word ids there are, , assign score (one possible weight can based on frequency of co-occurrence vs sum of frequency of individual word id). apply function 2 arrays array of scores. please let me know if there other approaches want to consider well. thanks!

data:

array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12',        '18/19/20/21/22/23/24/25',        '26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41',        '5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22',        '57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69',        '70/71/72/73/74/75/76/77',        '78/79/80/81/82/83/84/85/86/87/88/89/90/91',        '33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103',        '104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116',        '117/118/119/120/121/12/122/123/124/125',        '14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136',        '137/138/139/140/141/142',        '143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159',        '160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171',        '172/173/174/175/176/177/73/178/104/179/180/179/181/173',        '182/144/183/179/73',        '184/163/68/185/163/8/186/187/188/54/189/190/191',        '181/192/0/1/193/194/22/195',        '113/196/197/198/68/199/68/200/201/202/203/201',        '204/205/206/207/208/209/68/200',        '163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219',        '220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223',        '214/228/5/6/5/215/228/228/229',        '230/231/232/233/122/215/128/214/128/234/234',        '235/236/191/237/92/93/238/239',        '13/14/44/44/240/241/242/49/54/243/244/245/55/56',        '220/21/246/38/247/201/248/73/160/249/250/203/201',        '214/49/251/252/253/254/255/256/257/258'],        dtype='|s127')  array(['151/308/309/310/311/215/312/160/313/214/49/12',        '314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323',        '324/325/62/220/326/194/327/328/218/76/241/329',        '330/29/22/103/331/314/68/80/49',        '78/332/85/96/97/227/333/4/334/188',        '57/335/336/34/187/337/21/338/212/213/339/340',        '341/342/167/343/8/254/154/61/344',        '2/292/345/346/42/347/348/348/100/349/202/161/263',        '283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355',        '137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362',        '23/363/10/364/289/68/123/354/355',        '188/28/365/149/366/98/367/368/369/370/371/372/368',        '373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19',        '179/376/377/378/179/87/88/379/20',        '380/85/381/333/382/215/128/383/384', '385/129/386/387/388',        '389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396',        '397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80',        '77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156',        '129/295/90/259/38/39/119/414/415/416/14/318/417/418',        '419/420/421/422/423/23/424/241/421/425/58',        '426/244/427/5/428/49/76/429/430/431',        '257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170',        '439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448',        '385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'],        dtype='|s127') 

the following code should facilitate need in python 3.x

import numpy np collections import counter  def jaccardsim(c1, c2):     cu = c1 | c2     ci = c1 & c2     sim = sum(ci.values()) / sum(cu.values())     return sim  def bytearraysim(b1, b2):     ca = [counter(b1[i].decode(encoding="utf-8", errors="strict").split("/"))           in range(len(b1))]     cb = [counter(b2[i].decode(encoding="utf-8", errors="strict").split("/"))           in range(len(b2))]      # assuming both 'a' , 'b' in same length     csim = [jaccardsim(ca[i], cb[i]) in range(len(a))]      return csim # array of similarities 

jaccard similarity score used in implementation. may other scores, such cosine or hamming, liking.

assuming arrays stored in variables a , b, resulting function bytearraysim(a,b) outputs following similarity scores:

[0.0,  0.0,  0.0,  0.038461538461538464,  0.0,  0.041666666666666664,  0.0,  0.0,  0.0,  0.08,  0.0,  0.05555555555555555,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.0,  0.058823529411764705,  0.0,  0.0,  0.0,  0.05555555555555555,  0.0,  0.0,  0.0,  0.0,  0.0] 

Comments

Popular posts from this blog

php - How to display all orders for a single product showing the most recent first? Woocommerce -

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

angularjs - How restrict admin panel using in backend laravel and admin panel on angular? -